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XML data projection (or pruning) is a natural optimization for main 
memory query engines: given a query Q over a document D, the sub- 
trees of D that are not necessary to evaluate Q are pruned, thus produ- 
cing a smaller document D'; the query Q is then executed on D' , hence 
avoiding to allocate and process nodes that will never be reached by Q. 
In this article, we propose a new approach, based on types, that greatly 
improves current solutions. Besides providing comparable or greater 
precision and far lesser pruning overhead, our solution — unlike current 
approaches — takes into account backward axes, predicates, and can be 
applied to multiple queries rather than just to single ones. A side con- 
tribution is a new type system for XPath able to handle backward axes. 
The soundness of our approach is formally proved. Furthermore, we 
prove that the approach is also complete (i.e., yields the best possible 
type-driven pruning) for a relevant class of queries and Schemas. We 
further validate our approach using the XMark and XPathMark bench- 
marks and show that pruning not only improves the main memory query 
engine's performances (as expected) but also those of state of the art 
native XML databases. 

1 Introduction 

Main-memory XML query engines are often the primary choice for applications that do not 
wish or cannot afford to build secondary storage indexes or load a database before query 
processing. One of the main optimisation techniques recently adopted in this context is 
XML data projection (or pruning) lE7lH"3l . 

The basic idea behind document projection is very simple and powerful at the same 
time. Given a query Q over a document D, sub-trees of D that are not necessary to evaluate 
Q are pruned, thus yielding a smaller document D' . Then Q is executed over D' , hence 
avoiding to allocate and process nodes that will never be reached by navigational specifica- 
tions in Q. This ensures that evaluation over D' is equivalent to and more efficient than the 
evaluation over D. 

As shown in 1271 [131 . XML navigation specifications expressed in queries tend to be 
very selective, especially in terms of document structure. Therefore, pruning may yield 
significant improvements both in terms of execution time and in terms of memory usage: 
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as a matter of facts, for main-memory XML query engines, very large documents can not 
be queried without pruning. 

1.1 State of the art 

Marian and Simeon Il27l propose that the actual data-needs of an XQuery query Q (that is, 
the part of data that is necessary to the execution of the query) is determined by statically 
extracting all paths in Q. These paths are then applied to D at load time, in a SAX-event 
based fashion, in order to prune unneeded parts of data. The technique is powerful since: 
(/) it applies to most of XQuery core, (ii) it can be applied to a set of queries over the same 
document, and (Hi) it does not require any a priori knowledge of the structure of D. How- 
ever, this technique suffers some limitations. First, the document loader-prunercan manage 
neither backward axes nor path expressions with predicates (sometimes called "qualifiers") 
which, especially the latter, can contain precious information to optimise pruning. Second, 
the advantage described in point (Hi) becomes a big drawback when "//" occur in paths 
since, in that case, the technique does not behave efficiently in terms of loading time and 
pruning precision (hence, memory allocation). Indeed, when a // is present in a projec- 
tion path, the pruning process requires to visit all descendants of a node in order to decide 
whether the node contains a useful descendant. What is worst is that pruning time tends 
to be quite high and it drastically increases (together with memory consumption) when the 
number of // augments in the pruning path-set. As a matter of fact, in this technique, prun- 
ing corresponds to computing a further query whose time and memory occupation may 
be comparable to those required to compute the original query. In particular, in this tech- 
nique every occurrence of // may yield a full exploration of the tree (e.g., see in ||271 the 
test for the XMark ll32l query Q7 which only contains three // steps and for which just 
computing the pruning takes longer than executing the query on the original document). 
Therefore, pruning execution overhead and its high memory footprint may jeopardise the 
gains obtained by using the pruned document. Third and finally, as we explain in Section|7] 
the precision of pruning drastically degrades (down to being nullified) for queries contain- 
ing the XPath expressions descendant \\node\_cond~\ , which are very useful and used in 
practice. 

Bressan et al. ITT31 introduce a different and quite precise XML pruning technique for a 
subset of XQuery FLWR expressions. The technique is based on the a priori knowledge of 
a data-guide for D. The document D is first matched against an abstract representation of 
Q. Pruning is then performed at run time, it is very precise, and, thanks to the use of some 
indexes over the data-guide, it ensures good improvements in terms of query execution 
time. However, the technique is one-query oriented — in the sense that it cannot be applied 
to multiple queries — , it does not handle XPath predicates, and cannot handle backward 
axes (recall that the encodings of OTI are defined for XPath, and no extension to XQuery- 
like languages is known). Also, the approach requires the construction and management of 
the data-guide and of adequate indexes. 

Motivated by efficient XML stream processing, Green et al. l23l introduced a frame- 
work for discarding sequences of SAX events in an XML data stream. Although their 
approach allows them to prune an input stream with respect to sets of queries, the language 
they handle is restricted to forward linear XPath expressions (that is, XPath expressions 
with only child and descendant axes and without predicates). 

1.2 Our contribution 

In this article, we present a new pruning approach that is applicable in the presence of 
typed XML data. This is often the case, as most applications require that data are valid 
with respect to some external schema (e.g., Dtd |fl9ll or XML Schema (37)). 

Our technique combines the advantages of the previously mentioned works while re- 
laxing their limitations. Unlike ll27l[T3ll23l . our approach accounts for backward axes, per- 
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forms a fine-grained analysis of predicates, allows (unlike |fl3j|) for dealing with bunches 
of queries, and (unlike (27]) cannot be jeopardised by pruning overhead. Our solution 
provides in all cases comparable or greater precision than the other approaches, while it 
requires always negligible or no pruning overhead. Moreover, contrary to 11271 H"3l . our 
approach is formally proved to be sound (i.e., pruning does not alter the result of queries) 
and, furthermore, we can also prove it to be complete (i.e., it produces the best possible 
type-driven pruning) for a substantial class of queries and DTDs. 

For the sake of presentation we introduce our framework in three steps. In the first 
step, we consider a simplified version of XPath, we dub XPath*, which includes only up- 
ward/downward axes and unnested disjunctive predicates. We define for XPath* a static 
analysis that determines a set of type names, a type projector, that is then used to prune the 
document(s). One of the particular features of this approach is that our pruning algorithm is 
characterised by a constant (and low) memory consumption and by an execution time linear 
in the size of the document to prune. More precisely, a pruning based on type projectors 
is equivalent to a single buffer-less one-pass traversal of the parsed document (it simply 
discards elements not generated by any of the names in the projector). So if embedded in 
query processors, pruning can be executed during parsing and/or validation and brings no 
overhead at all, while if used as an external tool it requires a time always smaller than or 
equal to the time used to parse the queried document. Soundness and (partial) completeness 
results for the static analysis are stated. 

The second step consists of extending the analysis to the whole XPath (more precisely, 
to XPath 1.0), that is, we need to show how to deal with missing axes and with general 
predicates as defined in the XPath specification. This is done by associating to each XPath 
query Q a XPath* query P that soundly approximates Q, in the sense that the projector 
inferred for P by the static analysis developed at the first step is also a sound projector for 

Q- 

The final step of our process is to extend the approach to XQuery (hence, to XPath 2.0). 
This is obtained in the same way as done in |27l . by defining a path extraction algorithm. 
Our path extraction algorithm improves and extends in several aspects (in particular, in 
terms of extracted paths' selectivity) the one of 11271 . It also computes the XPath* approx- 
imation of the extracted paths so that the static analysis of the first step can be directly 
applied to them. 

We prove some important closure properties that guarantee that type projections can 
always be performed at load time during the validation process, and this without any over- 
head. In particular for XML documents typed with DTDs or XML Schemas the document 
can be pruned in streaming. 

We gauged and validated our approach by testing it both on the XPathMark ETl and on 
the XMark f32| benchmarks. The result of this validation confirmed what was expected: 
thanks to the handling of backward axes and of predicates the precision of our pruning is 
in general noticeably higher than that of current approaches; the pruning time is linear in 
the size of the queried document and has a very low memory footprint; the time of the 
static analysis is always negligible (lower than half a second on the hardware we used for 
our benchmarks described in Section [9} even for complex queries and DTDs. But bench- 
marks also brought unexpected (and quite pleasant) results. In particular, they showed that 
type-based pruning brings benefits that go beyond those of the reduced size of the pruned 
document: by excluding a whole set of data structures (those whose type names are not 
included in the type projector), the pruning may drastically reduce the resources that must 
be allocated at run-time by the query processor. For instance, our benchmarks show that 
for several XMark and XPathMark queries our pruning yields a document whose size is 
two thirds of the size of the original document, but the query can then be processed using 
three times less memory than when processed on the original document. This is a very im- 
portant gain, especially for DOM-based processors, or memory sensitive processors. Not 
only our approach is relevant in the case of main memory query engines such as Saxon but 
it is also shown to be useful for native query engines as efficient as MonetDB I12H . Even in 
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the latter case our experiments demonstrate the relevance of type projection as a comple- 
mentary optimisation technique. Indeed, this not totally surprising as type projection can 
be thought of as a way of defining clustering policies in the same line as what was done in 
the context of object-oriented databases [8, 4, 7). Clustering and indexing are well-known 
complementary tools used in the context of query optimisation. 

As an aside we want to stress that our technique relies on the definition of a new type 
system for XPath able to handle backward axes which, alone on its own, constitutes a 
contribution of this work. In particular the precision of type inference for backward axes 
goes beyond what is proposed in the XQuery Static Semantic recommendation (| 18 |). 

Finally, we presented a preliminary version of this work at the VLDB 2006 confer- 
ence 0. The work in this article, besides including full proofs and having been cleaned 
up, improves and extends the work in J5] in several important aspects. First and foremost 
we generalized the definition of type projectors by using as projectors sets of production 
rules (as opposed to the sets of non-terminals used in Q) of regular tree grammars (as 
opposed to the DTDs used in |5j). This generalization was far from being straightfor- 
ward. In particular, we had to prove the applicability of our technique to the more general 
framework under consideration (cf. Section 13.2) . However the result is worth the effort 
since the advantages of this generalization are twofold. On the one hand using regular tree 
grammars allows us to compute type projectors for every kind of XML schema formalism 
we are aware of as, for instance, DTDs, XMLSchemas, CDuce and XDuce types, Relax- 
Core and TREX schemas. On the other hand, inferring grammar production rules rather 
than grammar non-terminals allows us to compute context-aware and, thus, more precise 
projectors. More precisely, the new type projectors introduced in this work can prune a 
subtree not only based on its tag (as it was done in |5 j), but also on any structural condition 
expressible by a regular tree language. So for instance our pruning process may decide 
to prune just one of two trees generated by the same non-terminal, because they appear 
in different contexts (in either both trees were pruned or they were both preserved). 
Therefore these new projectors are both more general and can perform much finer-grained 
pruning. Second, although we develop the theory of type projection for a simplified data- 
model and restricted forms of XPath expressions, we thoroughly detail how to tackle many 
of the peculiarities of the XML and XPath specifications Il34l[35ll . including the handling of 
attributes, the presence of absolute axes in XPath predicates or a wide range of predefined 
XPath functions (all absent in 15)). The path language we formally study extends the one 
in JJJ with top-level unions of paths, predicate conjunctions ("and") and arbitrarily nested 
predicates (our previous work formally treated only non-nested predicates and resorted to 
an approximation in the case of nested predicates). Third, we provide an extensive list of 
experiments showing the overall benefits of type projection for a wide range of queries and 
query engines. These experiments supersede the early benchmarks realised in and show 
that despite the advances in XML query technologies in the recent years, our static analysis 
can significantly improve the performances (both in time and memory consumption) of 
many different XML query engines. 

1.3 Plan of the article 

The article is organised as follows. Section [2] introduces basic definitions and notations: 
data model, types, validation. Sectionpjpresents type projectors, type-based projection, and 
several theoretical (closure) properties. In Section 0we define XPath' and its semantics, 
and formally describe how general XPath predicates can be soundly approximated in it. In 
Section[5jwe present our type projectors inference algorithm for XPath' and state its formal 
properties. In Section|6]we extend our approach to full XPath and in Section|7]to XQuery. 
In Section [8] we discuss how to apply our technique to other typing policies as well as to 
un-typed documents. Section |9]presents our implementation and reports the results of our 
benchmarks. We finally conclude in Section[lO]by presenting the perspectives of this work. 
Last, for the sake of clarity, all the proofs for the stated results are given in Appendix lAl 
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2 Notations 



2.1 Data Model 



For the sake of concision and clarity we present our solution for a simplified version of the 
XQuery data model where we do not consider node attributes. However, attributes are fully 
supported in our implementation through a trivial encoding, documented in Section [6] An 
instance of the XQuery data model can then be generated by the following grammar: 

Definition 2.1 (Data model) 



Essentially, an instance of the XQuery data model is an ordered sequence of labelled 
ordered trees (ranged over by t). That is, an ordered forest (ranged over by /), where 
each node has a unique identifier (ranged over by i) and where () denotes the empty forest. 
Tree nodes are labelled by element tags (ranged over by /) while, without loss of generality, 
we consider only leaves that are text nodes (that is, strings, ranged over by s) or empty trees 
(that is, elements that label the empty forest). 

We define a complete partial order -< on forests (and thus on trees) by relating a forest 
with the forests obtained either by adding or by deleting subforests: 

Definition 2.2 (Projection {<)) Given two forests f and f we say that f is a projection 
off, noted as f < f, iff' is obtained by replacing some subforests off by the empty forest. 
In other terms ^ is the smallest pre-congruence on forests that contains () ^ / for all /.□ 

We also define a notion of good formation, with respect to the data model given in 
Definition l2.il 

Definition 2.3 (Good formation) A forest is well formed every identifier i occurs in it 
at most once. Given a well-formed forest f and an identifier i occurring in it, we denote by 
f@i the unique subtree t of f such that t — 5, or t = h\f'\. The set of identifiers of a forest 
f is then defined as Ids(f) = {i | 3 t. f@i = t } □ 

Henceforth we will consider only well-formed forests and confound the notions of a node 
with that of the identifier of the node. 

Definition 2.4 (Root id) Let t be a tree. lft=s l or t = h\f\> we define RootId(t) = i. 

2.2 Types and validation 

In this work, we present our approach for an abstract model of types, namely regular tree 
grammars. It is well known that regular tree grammars encompass most of the features 
of well established and standardized schema specifications such as DTDs, XMLSchemas, 
RelaxNG definitions, XDuce and CDuce's regular expression types. This is for instance 
documented in 1291 . from where we borrow the definition of regular tree grammar: 

Definition 2.5 (Regular tree grammar) A regular tree grammar is a pair (5? \E) where 
is a set of distinguished names ( actually, non-terminal meta-variables ) and E is a set of 
productions rules of the form {X\ — >■/?!,... ,X„ — > R„} such that: 

1. each R{ is either the terminal String — denoting string content — , or the terminal 
Any — denoting any tree — , or l[ r ] where I ranges over valid element names and 
r is a regular expression on the non-terminal symbols X\,. . . , X n , that is: 



Tree t 
Forest f 



Si I h\f) 

I /,/ \t □ 



RegExp r 



e 



( empty sequence) 
(sequence) 
(alternation) 
(Kleene star) 
(non-terminal) 



r r 
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(henceforth, we use r+for r r* and r? for £\r); 

2. 5? C {Xi, . . . ,X„} is the set of start symbols; 

3. for any two production rules with the same left hand side Xj —> I [r] and Xj — > I' [r'\, 
we have I ^ I'; □ 

The intuition is that a regular tree grammar describes (i.e., it "types") a set of trees of the 
data-model. Notice that the left-hand sides of the rules in E do not need to be pairwise 
distinct. Indeed, production rules such as X — > R\,X — >• 2?2 are necessary if one wants to 
encode complex schemas. Furthermore, given a regular tree grammar, it is always possible 
to equivalently rewrite it so that condition 3 holds: if there are two rules Xj — > l[r] and 
Xj —> [[r 1 ] then they can be merged into a single rule, Xj —> l[r\r r \. 

Definition 2.6 (Names of a reguiar expression) Given a regular expression r we denote 
by Names (r) the set of non- terminals occurring in it, namely: 

Names (e) = 

Names (r i r%) — Names {z*i)U Names (ra) 

Names (r\ \ r-i) = Names(r\)UNames(r2) 

Names(r*) = Names(r) 

Names (X) = {X} □ 

By extension, given a set E = {Xq — > Rq, . . . ,X„ — > R„}, we define 

Names (E) = (J Names (Rj) 

andZ)n(E) for the set of names defined in £ (that is, {X\ . . .X„}). While for all types ,E) 
we have Names(E) —Dn(E), we handle incomplete sets of rules during the formalisation 
of the algorithms, whence the need for both notations. We also say that r is a regular 
expression over (S^,E), if r is a regular expression over names in Dn(E). We will denote 
by £(r) the language recognized by the regular expression r. We will use W, X, Y, Z to 
range over names. We use Greek letters to range over sets of rules. As (J? ,E) represents 
a regular tree grammar we shall use n to stress that the set of rules is a type projector [cf. 
Definition l3.il and fc and T to stress that the set is used as a context or as a type, respectively 
[cf. Section [5TI . Last, we shall use S to range over sets of (node) identifiers. 
We illustrate the syntax of regular tree grammars with the following example: 

Example 2.7 (A regular tree grammar for the bibliography DTD) The well known bib- 
liography DTD (taken from the XML Query use cases Ifl5l ) can be written as a regular tree 
grammar ({X},E), with unique start symbol X and the following set E of rules: 

X -> bib[Book*} 

Book -> book[7We, (Author + \Editor+),Publ] 

Title — > t±tle[String] 

Author — > author[String] 

Editor — > edit or [String] 

Publ — > publisher[Sfn'ng] 

This regular tree grammar "types" all XML documents (i.e., trees of the data model) that 
are rooted in a bib element, that contains a possibly empty list of book elements, each one 
containing a list starting with a title element containing a string, followed by anon-empty 
homogeneous list formed either by author elements or editor elements, and ended by a 
publisher element. 

The concept of typing an XML document by a regular tree grammar is formalized by 
the notion of validity defined as follows: 
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Definition 2.8 (Valid Trees) A tree t is valid with respect to a type (5? ,E), if there exists 
a mapping (interpretation) 3 from Ids(t) to Names{E) such that: 

1. 3{RootId{t)) £ .9 1 

2. for each i in Ids(t), ift@\ = i, then either 3(i) — > Any £ E or 3(i) —5- String £ E 

3. for each i in Ids (t), ift@\ = li\t\,...,t n \, then either we have 3(i) -^Any£E orwe 
have 3 (i) -4 /[r]££ and3(RootId{h)),...,3(RootId(t n )) £ £(r). 

/« tftis case we say f/iaf f is 3-valid with respect to (S fi ,E) and write t £3 (S^,E) to indicate 
it. □ 

For instance the following tree (in which we omit the node identifiers) 

bib[ 
book [ 

title ["Divina Commedia"] , 

author ["Dante"] , 

publisher ["Ludovico Dolce"] 

] 

] 

is valid with respect to the type ({X},E) defined in Example l2.7l There exist various tech- 
niques and algorithms to validate XML trees against regular tree grammars (for instance, 
by using tree automata: cf. Algorithm 4.4 in ll29l ). Note however that due to our use of reg- 
ular tree grammars, the interpretation 3 might not be unique and that a validating algorithm 
will generate — for a document t and a type (^,E) — one possible interpretation such that 
t is 3-valid with respect to [S^,E). 

Given a tree t valid with respect to a type (S^jE), we can use subsets of E to project 
that tree. Essentially, from the rules in E we compute another set of "simpler" rules which 
denotes only the nodes to be kept. In order to define formally this notion we need to define 
the reachability relation =^£, that we introduce below together with several other definitions 
that we use later in the article. 

Definition 2.9 (Forward Reachability) Given a type {■S fi ,E) and Z £ Dn(E), we write 
Z Y if and only if Z — S> R £ E and Y £ Names(R). We use and to denote 
respectively the transitive closure and the transitive and reflexive closure o/=>£. □ 

Strings of names are called chains and ranged over by c, c,-, c',. . . In particular we use 
Chains i X E \ (Y) to denote the set of all chains rooted at Y, and defined as {YX\ . . . X„ \ Y =>£ 
X\ . . . =>£ X„,n > 0}. We use Names (c) to denote the set of all names occurring in a 
chain c. 

At the beginning of the section we defined the projection of a forest as a forest obtained 
by replacing some subforests by the empty tree. Here we define an analogous concept for 
types, called erasure according to which a type is obtained from another by replacing some 
non-terminals by the empty regular expression. 

Definition 2.10 (Erasure of a regular expression) Let r be a regular expression and N a 
set of names. We define the erasure of r with respect to N and we note r\ff the regular 
expression inductively defined as: 

£\n = £ 

(rir 2 )\N = n\ N r 2 \ N 

(n I r 2 )|jv = n\ N I r 2 \ N 

(r*)\ N = (r|jv)* 

x\ N = x ifXeN 

X\ N = £ ifX^N □ 
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We generalize this notion to production rules of a grammar: 

Definition 2.11 (Erasure of a rule) LetX -^Rbe a production rule, andN a set of names. 
We define the erasure ofX — )• R with respect to N, noted (X — > R) \n, as: 

(X->l[r])\N = X^l[r\ N ] 
(X —> String)\N = X — >• String 
(X — >Any)\x = X ^ Any □ 

We recall that String and Any are special terminals denoting string and any content, re- 
spectively. We can finally define the erasure of a grammar: 

Definition 2.12 (Erasure of a tree grammar) Let (S?,E) and (Jf",E') be tree grammars. 
We say that {5?' ,E') is an erasure of (5? ,E), noted ',£') <: (S fi ,E), if and only if all 
the following conditions hold 

1. y'cy,- 

2. ifX -)• String G E', then X -> String G E; 

3. ifX -> Any G E' , then X ->■ Any G E; 

4. for all rules X — > l[ r' ] G E', there exists a rule X — > l[ r ] G E such that r 1 = r|jy for 
some N C Names (r) . □ 

In summary, an erasure of a type grammar erases some rules and some non-terminals in the 
regular expressions. 

Finally, we conclude this section by recalling few definitions taken from 1 29 1 that will 
be useful for establishing further results. 

Definition 2.13 (Competing non-terminals) Let (y,E) be a tree grammar. Let A,B G 
Names(E) be two non-terminals such that A ^ B. A and B are competing if and only if 
there exist A -> l[ r } G E and B — > l'[ r' } G E such that 1 = 1'. □ 

The definitions that are actually interesting are those of local and single-type tree grammars, 
which can by defined in terms of competing non-terminals: 

Definition 2.14 (local tree grammar) A regular tree grammar (y,E) is a local tree gram- 
mar if and only if: 

• \y\ < 1 

• E does not contain any competing non-terminals 

• For all Y G Names(E) there is exactly one rule in E whose left-hand-side is Y. 

Definition 2.15 (single-type tree grammar) A regular tree grammar (y,E) is a single- 
type tree grammar if and only if: 

1. For all X —yl[r] G E, if A, B in Names(r) and A ^ B, then A and B are not competing 
and 

2. no pair of distinct non-terminals in 5? is competing. □ 

The interest of these two definitions is that — as shown in l29l — they characterize the struc- 
tural constraints that can be expressed by the two most widespread schema formalisms, 
namely DTDs (which roughly correspond to local tree grammars) and XML Schemas 
(which are, essentially, single-type tree grammars). 
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3 Type projectors 



In this section we shall first precisely define what type projectors are and then establish 
some useful closure results on type projectors. 

3.1 Definition 

Definition 3.1 (Type Projector) Given a type (,y,E), a 

(possibly empty) set of rules % C E is a type projector if and only if [S" P\ Names (7t),7t) is 
a regular tree grammar erasure of '(5? ,E). 

A type projector is thus a set of rules obtained from the type (y,E) by erasing some rules 
and some non-terminals in the remaining rules. 

A type projector for a given type describes a particular pruning for XML documents of 
that type, that is, a type driven projection: 

Definition 3.2 (Type Driven Projections) Let n be a type projector for (S* \E) and t a 

forest such that t G3 (.y,E). The it-projection of t, noted as t\jK, is defined as follows: 



OVj* = 







= 


Si 


if -t String £ % or Any E n 







if 3(1} -> String g % and — > Any $ n 


h[f]\37t = 


h[f] 


if3(i) -+Any e n 


k[f]hn = 


h[f\ 


37c] if3(i)— tl[r] G 7tand"3(i)— >Any £ % 


h[f]\3n = 
(f,f')\3n = 





if3(i)— tl[r] f % and 3 >■ Any % 


(A3 





□ 

In words, pruning erases (by replacing it by an empty forest) every node that cannot be 
derived by a rule in n. 

Lemma 3.3 Let nbe a type projector for (Sf ,E). Then for every tree t £3 (y,E) it holds 

As the knowledgeable reader might have already noticed, validation (as in Definition ^. 81 ) 
and type-driven projection are quite similar. Given a tree t and a type (y,E), a validation 
algorithm builds an interpretation 3 of t with respect to that type. More precisely, the 
algorithm associates to each node of t a non terminal of E. If it cannot find at least one, 
validation fails and the tree is not valid with respect to (J^E). A type-driven projecting 
algorithm works exactly in the same way but when a node cannot be associated with a 
name it is simply discarded together with the associated subtree. Projecting a document 
can be seen as an instance of validation. This observation is precious to determine the 
complexity of type-driven projection, given a particular type projector %. If % is a local tree 
grammar or a single-type tree grammar (that is, a DTD or an XML-Schema, see |[29|) then 
projection can be performed in streaming. On the contrary, if n ends-up being an general 
tree grammar, then projection might require in the worst case to keep the whole tree in 
memory (see our remark at the end of Section 13.21 for how to use type projection in this 
particular setting). 

3.2 Closure properties 

The fact that a type projector is a DTD or an XMLSchema, then type-driven projection 
can be done efficiently is already a good thing. However, we can show a stronger result: 
a type projector inherits the properties of the type it was deduced from. This is important 
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since in practice if someone chooses to use DTDs or XML-Schemas to specify their docu- 
ments, the projection process should not be more expensive than the validation process. 

Indeed, a nice property of the erasure of a type is that it preserves both the local tree 
and single type property. In other words, the erasure of a DTD remains a DTD and the 
erasure of an XML-Schema remains an XML-Schema. This is stated by the two following 
lemmas. 

Lemma 3.4 (Erasure preserves locality) Let {S^,E) be a local tree grammar and {S fil ,E') 
a regular tree grammar. If ,E') <: (S^,E) then (S^" ,E') is a local tree grammar. 

Lemma 3.5 (Erasure preserves single-typedness) Let {S^,E) be a single-type tree gram- 
mar and (5?" ' ,E') a regular tree grammar. If ,E') <: ,E) then (J7",E') is a single- 
type tree grammar. 

Last but not least, we show that if two projectors coming from the same type enjoy the 
local (resp. single-type) property, then their union is also local (resp. single-type). This 
property of type projectors is instrumental to our approach. Indeed, given a set of paths, 
we will compute a type projector for it by taking the union of all the type projectors of the 
individual paths. However, if taking the union of type projectors caused the loss of local or 
single-type properties, the interest of extending our approach to sets of paths (and thus to 
XQuery or to bunches of queries) would be quite limited. 

The key observation here is that, while in general local and single-type tree gram- 
mars are not closed under union, two type-projectors that come from the same type share 
a common structure and therefore are not completely independent one from the other. In 
particular we can show that the union of two type projectors for the same type cannot in- 
troduce competing non-terminals in the resulting type projector. In terms of term-rewrite 
systems, we can say that the union of two type projectors does not introduce a critical pair 
(of non-terminals). 

Lemma 3.6 (Union closure of local type projectors) Let (^,E) be a local tree gram- 
mar. Let (J^ij-Ei) and {S^2,E%) be two tree grammars such that (5f\,E\) <: (S* \E) and 
(J?2,E2) <: (J? ,E). Then \JS^i,E\ UE2) is a local tree grammar. 

Lemma 3.7 (Union closure of single-type type projectors) Let {y,E) be a single-type 
tree grammar. Let (y\,E\) and (S^ItEi) be two tree grammars such that [S^\,E\) <: 
(5? ,E~) and (=5^2, £2) <: {S^,E). Then {S?\ U J?2,£) U.E2) is a single-type tree grammar. 

To conclude this presentation of the formal properties of type-projectors we could note 
that a third category of deterministic regular tree grammars, namely restrained-compe- 
tition tree grammars (see [29] ). is not closed under erasure. Therefore, for this kind of 
schemas (and associated type-projectors) pruning might require a full buffering of the input 
document. However this is only of mild importance since, to the best of our knowledge, no 
well-known schema specification relies on it. All the other schema specifications that we 
are aware of (XDuce and CDuce regular expression types, TREX, Relax Core,. . . ) possess 
the full expressive power of regular tree languages which, as it is well-known, are closed 
under erasure and union (see for instance ITT61 ). This means that type driven projection 
proposed here can be applied to these kinds of schemas, as well. However, projection 
remains as expensive as validation which, for these particular schemas, implies that the 
whole document might need to be loaded into memory to actually decide which subtrees 
must be pruned. Practical solutions to this problem are discussed in Section liOl 

4 XPath £ 

In XPath, queries are expressed by defining a path of steps separated by "/". For instance, 

Q = /descendant :: author /child:: text [self ::node = "Danfe"] / 

parent ::book/ child:: title 
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is the query that returns all titles of books whose author is "Dante" . First, the navigational 
part instructs to descend to all text nodes whose parent is an author (by following the path 
/descendant::aMf/wr/child::fexf), then the predicate selects those nodes that are the 
string "Dante" (with the test self ::node =" Dante"), and finally the navigation ascends to 
the book element and descends to the title. 

The inference rules we define in Section |5]do not work directly on queries such as Q. 
The rules are defined for a subset of XPath that we dub XPath' and introduce in this sec- 
tion. XPath' (for XPath light) includes forward and backward axes and a special kind of 
predicates. In order to statically analyse Q (or any other XPath query that is not in XPath'), 
we will find a XPauV query that approximates Q soundly with respect to the pruning in- 
ferred (Section[6]l, and use it to deduce the pruning for Q. Of course, these approximations, 
as well as those we introduce later on, will only be used to determine the pruning: the 
pruned document will be queried by the original query. Therefore we are going to proceed 
as follows. In this section we define XPath', which is roughly equivalent to the structural 
subset of positive XPath Core, without absolute paths. Then in Section|5] we introduce our 
type and type-projector inference algorithms, which work on XPath' queries. To complete 
the treatment of XPath we show in Section |6]how to compute a sound approximation of a 
query Q with respect to type projection. In other words, given a (full) XPath query Q, we 
will compute an XPath' query Q' such that the type projector inferred from Q' preserves 
the semantics of Q. 

Let us start with defining XPath' paths and their semantics. From now on, "path" refers 
to an XPath' query as defined hereafter unless otherwise specified. 

Definition 4.1 (XPath' path) An XPath 1 path is a term inductively generated by the fol- 
lowing grammar: 

Path ::= Step | Path /Path \ Path \ Path 

Step ::= Axis : : Test \ Axis : : Test[Cond] 

Axis ::— self | child | descendant | parent | ancestor 

Test ::= tag | node | text 

Cond ::= Cond or Cond \ Cond and Cond \ Path 

where tag is a meta-variable ranging over element tags. □ 

As customary, "and" takes precedence over "or" and the path delimiter "/" takes preced- 
ence over the top-level union " |". We will also use the (possibly indexed) meta- variables 
P and C to range over paths and conditions, respectively. 

The formal semantics of paths is inductively defined on the productions of Defini- 
tion |4T) First, we formalise Test filtering as the set of nodes that satisfy a given test. Then 
Axis selection as the set of nodes reachable from some context nodes by following some 
Axis. Finally, we combine these notions to define the semantics of paths. The definitions 
comply with the semantics of XPath 1.0 (see [35)). 

Definition 4.2 (Node test semantics) Given a tree t and a set of nodes S C Ids(t) we 
define: 

S::,l = Sn{ields(t)\t@i = li\f}} 
S : :,node = S 

S::,text = Sn{i € Ids(t) | 3s, t@i = Si} □ 

Definition 4.3 (Axes selection) Given a tree t and a set of nodes S C Ids (t ) ( called context 
nodes), we define lAxis\,(S) as the set of nodes obtained by applying Step to each node in 
S: 

[self],(S) = S 
[child], (S) = [j i£S {i'\(i,i')EEdg(t)} 
[parent], (S) = [j ieS {i' | (i'.i) EEdg(t)} 
[descendant], (S) = \J ieS {i' \ (i,i') £Edg(t)+} 
[ancestor], (S) = \J ieS {i' \ (i',i) eEdg(t) + } 
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where Edg(t) is the edge relation oft, that is 



Edg{t) = {(i,i') | t@i = k\f,t',f] A RootId(t') = i'} 

and Edg(t) + denotes its transitive closure. □ 

Since predicates may contain paths and conversely, path and predicate semantics are mutu- 
ally defined. 

Definition 4.4 (XPath' semantics) Given t, a set S c Ids(t) and a path P, we define the 
evaluation of path P over the set of context nodes S as the function [P] r (S) defined as: 

(Axis: :Testj t (S) = (lAxisj, (S)) : : t Test 

[Axis: : Test[C]},(S) = (lAxis} t (S)) : :,Testn{i £ S \ Check,[C\{\)} 

{Pathi/Pathzlt{S) = (PathzUlPathUS)) 

{Path | Path 2 } t (S) = {Pathzjt (S) U {Path}, (S) 

where Check _[_\(_) is the Boolean function defined as: 

Check, [Path] (i) = {Path}, ({i} ) ^ 
Check,[C\ orC 2 ]{() = Check t [Ci}{\)VCheckt[C 2 ]{\) 
Check t \C\ andC 2 ]{{) = Check,[Ci](i)ACheck t [C 2 }(i) □ 

It is easy to see that the last definition is well founded since terms are inductively generated 
by the productions of the grammar in Definition 14. II 

Although the paths in XPath' are quite simple, the definition of their static analysis can 
result quite complex: the simultaneous presence in a single step of axes, tests, and predic- 
ates can cause a case explosion in the definition of the analysis. This is not a problem for 
a static analyzer, but it is a problem for a human reader. Fortunately, for the human reader, 
XPauV paths can be further simplified and transformed into equivalent normal forms in 
which all non trivial axes, tests and predicates are distributed over different steps. The idea 
is then to normalize paths before passing them to the static analyzer so that the definition 
of the latter can result much simpler. The normal forms that will be analyzed by the static 
analysis of Section|5]are defined as follows 

Definition 4.5 (Single step normal form) Let P be an XPath e query. The single step nor- 
mal form ofP, noted Snf(P), is defined as: 

Snf(Axis :: node) = Am:: node 

Snf (self :: Test) = self:: Test 

Snf (self ::node[C]) = self :: node [Dnf(C)1 

Snf (Axis :: Test) = Axis :: node/self :: Test (if Axis =/= self A Test =/= node) 

Snf (Axis :: Test [CI ) = Axis :: node/self :: Test/self ::node \_Dnf(C) ] 

(if Axis 7^ self A Test ^ node) 

SnJ \Pi/P 2 ) = Snf(I\)/Snf(P 2 ) 

where Dnf(C) is a disjunctive normal form of the Boolean proposition C (whose atoms are 
paths). □ 

It is clear from this definition that P and Snf(P) have the same semantics. Indeed, if we 
have a step 

Axis :: Test [C] 

then its single step normal form 
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Axis :: node/ self :: Test/self ::node [Dnf(C) ] 

only makes the order of node selection more explicilQ. For a given set of context nodes 
S, we first select all nodes that can be reached by the Axis. Then we keep only nodes that 
match the Test. Finally we further refine the result by filtering the nodes that satisfy the pre- 
dicate C, put in disjunctive normal form. The disjunctive normal form of our predicates is 
obtained by distributing the "or" over the "and" yielding a formula of the form Or,- Andy Py 
(where pj are paths). Although this may yield an exponential blow-up of the formula, re- 
member that we introduce this simplification only to provide a concise and human readable 
presentation of the static type inference algorithms. An actual implementation can work 
directly on the abstract syntax tree of the formula without resorting to this transformation. 

5 Static Analysis 

In this section we define deduction rules to statically infer from a XPath* path P and a type 
a type-projector for any input document valid with respect to (<y,E). We show that 
the analysis is sound, and that it enjoys completeness for a large class of queries when E is 
a ^-guarded and non-recursive local tree grammar (see Definition l5.7l later on). Soundness 
means that executing the query on the original document and on the document pruned 
by the inferred projector yields the same result. Completeness means that the analysis 
infers the best correct projector, that is, that if we take a type projector smaller (i.e., more 
selective) than the inferred one, then there exists a document validating {S^,E) for which 
the result of the two executions is not the same. When the conditions on schemas or on 
queries are relaxed, then the analysis is still sound but it may be not complete. Nevertheless, 
as we will formally illustrate, it is still very precise. 

In order to define our static type-projector inference algorithm we proceed in two steps. 

1. Given a path P and a regular expression grammar (J?,E) the rough idea is to use a 
type system to associate P with the set of all trees that may appear in the result of 
applying P to a document validating {S?,E). In order to achieve a great precision, 
we then "type" P by the set of all rules of E that validate any tree in the result^ This 
is done in Section l5TI 

2. Next, we use the type system defined in the previous point to define inference of type 
projectors. In particular we use the cases in which the previous type system returns 
an empty set of rules to determine the points in which pruning must be performed. 
This is done in Section l5^2l 

5.1 Type inference 

Given a path Path and a schema {5? ,E) we want to find a subset of rules in E that can 
generate all the trees that can occur in the result of Path when applied to a tree validating 
(S^jE). Formally, we want to infer a set TC£ such that 

yte-j{.Y,E), 3{{Path\ t {RootId{t))) QDn(x) (1) 

The equation above states the soundness of the analysis. In words it says that if we take 
any tree t valid for (S^,E) and we apply the path Path to it, then the type T inferred in the 
type systems defines every symbol interpreting a node in the result. As usual, soundness 
alone is not interesting since there always are sets that trivially satisfy it (notably, the set 

1 As an aside, note that this kind of equivalence does not hold for full XPath because of the posit ion () 
function. Indeed, descendant :: a [positionQ =1] and descendant ::node/self :: a [positionO =1] 
do not return, in general, the same result. The former returns the first "a"-node in pre-order while the latter returns 
all the "a"-nodes of the document. 

2 This yields a finer-grained analysis since different rules may generate the same tree but in different contexts. 
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of all rules in E). What we aim at is an analysis that is as selective as possible, that is, an 
analysis that is precise enough to guarantee, on a large class of types and for a large class of 
queries, that whenever the path semantics is empty over all possible instances of the input 
type, then the inferred type T is empty, as well: 

Vf G 3 (S*,E), 3({Pathl(RootId(t)))=0 => T=0 (2) 

(the converse is a consequence of ([D ). In other terms we want that if there does not exist 
any instance of the type that matches the path, then the path is typed by the empty set. 

The precision described by (|2) will then be used during the inference of type-projectors 
to discard elements that are useless in the evaluation of Path, that is, all the sub-trees of the 
original document that cannot be matched by Path. 

We start by inferring types for single-step paths without predicates. 

Definition 5.1 (Unconditional Single Step Typing) The type of an unconditional single- 
step query Axis : :Test for the schema ,E) is given by: 

T E {A E (y, Axis), Test) 

where axes are typed as: 



A E (t, ancestor) 




R<EE 


\Z^\ Y} 




YeDn(x) 






A £ (T,child) 


= 


ReE 






YeDn(x) 






A e (t, parent) 


= (JZ"> 


ReE 


\Z^ e Y} 




FgDh(t) 






Ae(t, descendant) 


= (J*- 


ReE 


\Y^Z} 




YeDn(z) 






A £ (T,self) 


= T 







and tests are typed as: 



Tg:(T,node) = T 

T E (r,a) = {Y ->R\ Y GDn(r),R=a[R'] orR=Any} 
T £ (r,text) = {Y^R | YeDn(x),R=StringorR=Any} 



□ 



This definition introduces two typing operators, one for axes, A_ (_,_), and one for 
tests, T_ (_,_). Firstly, Ke{t ,Axis) returns all the rules that can be reached from names in 
T following Axis. If Axis is self, child or descendant, our definition coincide with the 
static semantics of XQuery and XPath, as defined by Draper et al. in Ifl8l . However, Draper 
et al.'s static semantics is much less precise than ours in case of backward axis. Translated 
in our formalism, the type of parent and ancestor for any 1 would be {X —5- Any} for 
some name 

Secondly, T E (t, test) restrict the rules in 1 to only rules which type elements compatible 
with test. 

The soundness of this definition, that is, the property stated by Formula ([TJ is given by 
the following lemma. 

Lemma 5.2 Let the a tree 3-valid with respect to the schema (5? ,E\ For every S C Ids(t) 
and type T, if3(S) CDn(x), then 

1. 3(lAxis} t {S)) CDn(A E (x,Axis)) 

3 More precisely parent :: test and ancestor :: test return the union type element () I document () independ- 
ently of test; where element () is the type of any element node and document () is the type of the document 
node which we don't consider in our data-model. 
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2. 3{S::,Test)<ZDn{T E {iJest)) 

It is easy to check that the property stated by Formula dT} is a direct consequence of Defin- 
ition l4.4l and the composition of the two properties of the lemma above. 

The presence of upward axes makes the typing of composed paths much more difficult. 
To ensure precision, that is the property stated by Formula (|2j, we have to be careful in 
dealing with types in which an element may occur in the content of different elements. 
The naive solution consisting of inferring a type for composed paths by composing the 
functions we just defined for single steps, works only in the absence of upward axes. This 
can be illustrated by an example. Consider the following grammar rooted at X: 

X^a[Y], X^b[Z], Y ^ c[], Z d[] 
and observe that X yields two possible definitions. Now consider the path 

self ::a/child::c/parent::node 

applied to documents of the above type, then the precise type that this path should have 
is {X — > a[Y}}. However if we naively iterate Definition 15. II we obtain at the first step 
{X — > a[Y}}, onto which we apply child::c, which yields {Y —} c[ ]} to which we finally 
apply parent :: node which gives us {X — > a[Y], X — s> £>[Z]}, which is sound but imprecise. 
This is due to the fact that the single step typing blindly selects all rules associated with a 
name which can generate Y, here all the rules associated with X. 

To solve this problem we introduce particular sets of rules, called contexts, to be up- 
dated at each step and containing rules already encountered in previous steps. We then use 
them to refine type inference for upward axes. In the previous example, when typing the 
first two steps we build a context 

{X^a[Y],Y ->e[]} 

indicating that for the moment the two rules are the only ones visited by the traversal. 
Then, we use Definition [5J] to type parent :: node thus obtaining {X — > a [Y] , X — > b[Z]}, 
as before, but this time we intersect it with the context thus obtaining the precise answer 
{X — > a[Y]}. We now formalize this idea: 

Definition 5.3 (Type inference) Let {S^,E) be a type and P an XPath 1 query in single step 
normal form. Let T and K be two sets of rules of E. If the deduction system in Figure\l\ 
deduces for a path P the judgment, 

(t,k) h £ P:(T',0 

then we say that P has type {Dn{l'), %'). □ 

The idea underlying the judgments of the definition is that if the system proves (t, k) P : 
(V, k'), then from an input set of rules T and an input context K the application of P returns 
an output set of rules t' and an updated context k'. In other terms T is (the production 
part of) a type that approximates the current nodes, K is the context that was visited to type 
them, t' is (the production part of) a type that approximates the set of nodes reachable from 
the current ones by following P, and k' is the additional context visited to reach them. In 
Figure Q] environments — that is pairs of sets of rules — are ranged over by E for concision. 
E being a pair (t, k), we use E typ to denote its first projection (i.e., the "type" component 
t) and E ctx to denote the second projection (i.e., the "context" component k). 

Definition 5.4 (Environment well-formedness) Let (t, k) be an environment and E a set 
of rules. If X C E and K C t U (t, ancestor), then we say that (t, k) is well formed 
with respect to E. □ 

In other words, a context is well-formed if it contains only rules from which the names in 
Dn(z) are reachable. We say that a judgment E P : E' is well formed if both E and E' 
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Primitive Single Step 



Axis G {self , child, descendant} 

(down-axis) 



(up-axis) 



E \~e Axis :: node : (A£(E typ ,Ax/s) , E ctx U A£(E typ ,Aj«'.s)) 

Axis £ {parent, ancestor} 
E \~e Axis :: node : (Ag (Ety P , Ajt/*)) DT. k , A£(E ctx , Axis) nE ctx 



Test node 

(test) (*) 

E \~e self :: Test : (t, (E ctx PI Ae(t, ancestor)) U t) 

(*) where T = r £ (E typ , Tesf) 



(predicate) (**) 

E he self :: node[0r kndP jk ] : (t , (E ctx HA £ (t, ancestor)) U t) 

7 k 



(**) where T = {X t -> /?,- | [J Q Ejf p ^ 0} 



Composed paths 



E h £ Step : E" E" h £ Paf/i : E' 
(sequence) 



(union) 



Eh £ Step /Path:!,' 
E h £ Paf/i! : (Ti , JCi) E \- E Path 2 ■ (t 2 , K2) 



E h £ Pathi\Path2 : (Ti U T2, KTi U K2) 
Figure 1: Inference rules forXPath' queries 



are well formed with respect to E. We can remark that the rules in Figure Q~]are syntax 
directed — at most one rule apply for a given judgment — and they preserve context well- 
formedness. 

The rules are relatively simple to understand. The first two rules implement our main 
idea: when we follow an axis Axis, we compute the type by Xs^typ, Axis); if the axis is 
a downward one, then we add this type to the current context, otherwise if the axis is an 
upward one, then we intersect it with the current context (both for the type part and for 
the context part). The rule for (test) is slightly more difficult since it discards from the 
current set of rules those that do not satisfy the test: the type is computed by T £ (E typ , Test), 
while the context is obtained by removing all the rules that were in there just because they 
generated one of the discarded nodes; to do so it generates (the type of) all ancestors of 
the nodes satisfying the test, and intersects them with the current context. The fourth rule, 
(predicate), is the most difficult one. Recall that we work with single step normal forms 
and, therefore, that the predicates are Boolean formulas over paths in disjunctive normal 
form; the type T is obtained by discarding from Ety P all rules for which the predicate never 
holds; thus for each X, — > Rj in E typ we compute the type of all the paths Pa in the predicate, 
and keep in T only rules for which at least one path may yield a non-empty result; the 
context is then computed as in the deduction rule (test), by discarding from the context all 
rules that generated only rules discarded from Ety P . The deduction rule (sequence) chains 
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the result of one step to the following one. Lastly, the rule (union) handles the top-level 
union operator " |". 

Let us illustrate how the algorithm works on an example. Consider the grammar with 
rules 

{A -a b\c\e ./>' hi) ,C >c .1) •</ ./. >/> j 
and rooted in A, and the path 

child:: node/ self ::b/ self ::node [child:: node/self :: J] 

Notice that the path above is nothing but the single step normal form of 

child::fr[child::c/] 

We start from an initial environment 

E = ({A^a[B\C\E]},{A^a[B\C\E}}) 

in which both the context and the type component contain all the rules whose left hand side 
is a root of the grammar (in this case we have just one rule). The first step is typed with the 
(down-axis) rule, giving the result E 1 where 

I$„ = {B->b[D],C->c[],E->b[]} 

and 

E* tx = {A^-a[B\C\E],B^b[D\,C^c[],E^b[]} 
The second step is typed by applying the rule (test), which returns E 2 : 

Ll p = {B^b[D],E^b[]} 

and more interestingly, the context 

E 2 tx = {A -> a[B\C\E],B -> b[D],E -> b[]} 

Indeed, the intersection of E^ tx with the name generated by the ancestors of B, namely A 
yields exactly {A — > a[B\C\E]} to which we add the result of the current step: 

{B->b[D],E->b[]} 

As we said, this intersection ensures that we only keep in the context rules from which we 
can derive the current type. In this example, the rules for C which was introduced by the 
wildcard step child:: node is removed by the typing of the more restrictive step self ::b. 
The third step is typed by the (predicate) rule. Intuitively, this rule types independently the 
path child:: node/self :: d and keeps in the result only the input rules for which the path 
yields a non-empty result which, in this case, is the rule for B: 

Ll p = {B^b[D]} 

As before, the context is purged from rules that do not generate the current type: 

Hl tx = {A^a[B\C\E],B^b[D}} 

Before proving the main theorems of type inference, namely soundness and completeness, 
let us first show that the inference rules of Figure [T]form indeed an algorithm. 

Lemma 5.5 (Termination of type inference) Let {S?,E) be a type, Pa path, andlLandYl 
two environments. If there is a derivation for the judgmentY. P : E', then this derivation 
is unique and finite. 
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We can now proceed to prove the soundness of the type system. 

Theorem 5.6 (Soundness of type inference) Let (S*,E) be a type and P a path. Let Eq = 
{X^R \ X^R EE,X G y}. If(E ,E Q ) h E P: (t,?c) then: 

Dn(r)D [p(lP} t (RootId(t))) 

te 3 (,9'.E) 

The type system is sound. It is also complete for a particular class of schemas, namely local 
tree grammars that are ^-guarded, non-recursive, and parent-unambiguous. Intuitively, a 
type is *-guarded when every union occurring in its productions is guarded by * (or by +), 
it is non recursive if the depth of all documents validating it is bounded, while it is parent- 
unambiguous if no rule types both the parent and a strict ancestor of the parent of another 
name. Formally, we have the following definition: 

Definition 5.7 Let (y,E) be a local tree grammar. 

1. E is ^-guarded if for each Y —tl[r] in E, the regular expression is a product r = 
7*1 • • ■ r n and whenever r\ contains a union, then r,- = (V )*; 

2. E is non-recursive if it is never the case that Y Y, for any name Y G Names(E); 

3. E is parent-unambiguous ;/ for all chains c and names Y,Z such that 
cYZ G Chains E ^(X) the implication 

cYc'Z £ Chains /y t E)(X) c = e 

holds ( £ denotes the empty chain). □ 

Non-recursiveness and *-guardedness are properties enjoyed by a large number of com- 
monly used DTDs. As an example, the reader can consider the DTDs of the XML Query 
Use Cases lfT31 : among the ten DTDs defined in the Use Cases, seven are both non- 
recursive and ^-guarded, one is only ^-guarded, one is only non-recursive, and just one 
does not satisfy either property. Furthermore our personal experience is that most of the 
DTDs available on the web are ^-guarded. Concerning the parent-unambiguous property, 
although DTDs satisfying this property are less frequent (five on the ten DTDs in |[T5l ). its 
absence is in practice not very problematic since, as we will see, only the presence of the 
parent axis may hinder completeness in that case. 

Before proving the completeness of type inference, we illustrate on simple examples 
what happens when one of the conditions is not fulfilled. For *-guardedness, consider the 
grammar 

X ^a[B\C], B^b[], C^c[] 
rooted in X, together with the path: 

child:: node/self :: /Vparent:: node/child:: node 

For the first two steps, our algorithm would determine the exact type and context: 

L 2 = ({B^b[]},{X^a[B\C],B^b[}}) 

For the parent step, the type and context are: 

E 3 = ({X ^a[B\C]},{X ^a[B\C]}) 

which are also exact. However, the last step induces the final type: 

T^ p = {B^b[],C^c[]} 
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and the context: 

E 4 ctx = {X^a[B\C],B^b[},C^c[}} 

This is not exact because a document matching the first part, child:: node/self ::b does 
not have any "c" tag and therefore the rule C in the output type is superfluous: this query 
will never return a node with type C for a document of the considered type. Note that 
the condition that unions in regular expressions must be guarded must also hold for rules, 
namely that there must not be two rules Y — > l[r\] and Y — > l'[r2\ in the input type. Indeed 
these two rules behave like an un-guarded union and therefore jeopardize completeness. 
Local tree grammars forbid such rules and are thus an essential condition of the input type 
for completeness to hold. 

The recursiveness of the schema also interacts with the parent axis in a way that 
prevents completeness of type inference. Consider the grammar: 

{A4fl[B], B^b[B7}} 

and the path expression: 

child:: node/self :: b/ child:: node/ self :: ^/parent :: node 

Our type inference algorithm deduces on the second self :: b step that the output type is be 
{B —> b[Bl]}. However, the last step, parent ::node is typed with a type 

{A->a[B], B^b[Bl]} 

this is because in the grammar, A is a name reachable from B with a parent axis. However, 
consider any document valid with respect to this grammar. Either it has only one b element, 
in which case the result is empty, since we try to match two levels of b's with the query. Or 
it has at least two b's and then the output is always a b node (the topmost one). Therefore, 
an a node is never part of the result, while the type A is returned by our algorithm. 
Lastly, with the following parent-ambiguous grammar: 

{A^a[B | C],B ->b[],C ->c[B]} 

the algorithm fails to type exactly (but the output type is still sound) the query: 

child:: node/ self :: c/ child:: node/self :: bl parent :: node 

By a similar reasoning, we can see that the algorithm returns the rules 

{A ^ a[B\C],C ^ c[B}} 

while only nodes with tag c can be returned by this query. 

Intuitively, the reason why completeness does not hold in the three previous examples 
is that there are chains in the grammar that may not reflect actual paths in a document. 
For instance in the last example, in a document "a[ b[] ]", the chain "ACB" has no 
interpretation (since there are no c-nodes). In this case, there exists a valid document 
which does not contain all the paths described by the possible chains in its type. Therefore, 
the type inference algorithm will use chains and rules which are actually not part of the 
interpretation of some documents of the type at issue. Fortunately, if a local tree grammar 
is ^-guarded, non-recursive, and parent-unanbiguous, then there always exists a document 
in which all the chains in the grammar are instanciated by some path. We call such a 
document a witness of the grammar. We prove the existance of such a witness before 
stating the completeness theorem. 

Lemma 5.8 (Witness of a grammar) Let \E) be a non-recursive, ^-guarded, parent- 
unambiguous local tree grammar. There exists a document t, 3-validwith respect to (,^,E) 
such that: 

MX eDn(E),3ieIds(t) such that =X 
we call such a document a witness of the schema (.y,E). 
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Corollary 5.9 Let ({X},E) be a non-recursive, ^-guarded, parent-unambiguous local tree 
grammar and t be its witness. Let {Y\ . . . ,F„} C Dn(E). lfY\ . . . =>£ Y n , then there 
exists {ii , . . . ,i„} C Ids(t) such that 

V/ G {2 . . .«}, G £{t)) A 3(id t -i) = A 3(h) = Y, 

We are now equipped to state (and prove) the completeness theorem: 

Theorem 5.10 (Completeness of type inference) Let (S?,E) be a ^-guarded non-recursive 
and parent unambiguous local tree grammar, and P a path. Let 

E = {x^R\x^ReE,xe y}. 

If(E (h E Q )h E P:(x 1 K)then: 

Dn(z)C \J(iPl(RootId(t))) 

te 3 E 

One of the main reasons why completeness does not hold in general is because the 
intersections operated by the type rule for parent are not powerful enough to guarantee 
precision for recursive or parent-ambiguous grammar. In a nutshell, this happens because 
in the presence of parent-ambiguous grammar the type analysis may produce contexts con- 
taining false parent types (with respect the current type t). This suggests that in order to be 
extremely precise, instead of sets of rules, contexts should rather be sets of chains of names, 
computed and opportunely managed by the type analysis. However (i) managing sets of 
chains instead of simple sets of rules dramatically complicates the treatment, due to the in- 
teraction of recursive axes like descendant and recursive grammars, (ii) the problem may 
arise only for queries that use parent axis and the concomitance of parent-ambiguity make 
the event rare in practice, (Hi) the loss of precision looks in most cases negligible, (iv) even 
though it would be possible to obtain more precise results for a larger class of grammars, it 
is well known that exact type-inference for XPath routinely escapes regular tree languages 
and therefore all existing formalisms to type XML: at some point, an approximation in 
the type inference process is necessary to remain in the realm of regular types. Therefore 
we considered that such a small gain (remember that completeness is just some icing on 
the cake since, while it helps to gauge the precision of the approach, its absence does not 
hinder its application) did not justify the dramatic increase in complexity needed to relax 
the condition on the type for completeness to hold. 

Of course, the completeness theorem is only stated for XPath^ queries and does not 
account for full XPath queries. Yet it illustrates how precise our type system is in the 
best case. We will show on various example that on less favorable cases for schemas or 
for XPath queries which need to be approximated, the type inference still remains very 
precise. 

5.2 Type-Projection inference 

In this section we use the type inference of the previous section to infer type-projectors. 
Once more, naive solutions do not work. For instance, for simple paths Step x j . . ./Step n , 
we may consider as type projector with respect to (Y,E) the set 

U nuiXi^RilXie^} 

i=\...n 

where for i = 1 . . .«: 

E he Step x I . . . /Stepi : (t,-, -) 

with E = {Xi ->■ Rj | Xi e -y},{Xi ->■ Rj \ X t G .Y} (we use "-" as a placeholder for un- 
interesting parameters). This definition is sound but not precise at all, as can be seen by 
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considering the query descendant :: node/Path: the use of the above union yields a set 
containing X\ defined as 

E descendant :: node : (Ti, — ) 

that is, all descendants of the root start symbols in 5? (no pruning is performed). Instead, 
we would like to discard, at least, all rules that are descendants of 5? but that are not 
ancestors of a node matching Path. These are the rules 
Y — >/? G Te(A.e{,5^ , descendant), node) such that 

({Y -> R}, k) \- e descendant :: node/Path : (0, -) 

for some appropriate context K. A similar reasoning applies to ancestor. 

As for the type inference, we define type-projector inference by a judgment and asso- 
ciated inference rules: 

Definition 5.11 (Type-projector inference) Let (^,E) be a type and P an XPath e query 
in simple step normal form, and T and K be subsets ofE. If the deduction system in Figure\2\ 
proves the judgment 

(t, K) U-b P : % 

then the type-projector induced by % is the grammar: 

(,ynDn(7t),{(X^R)\ Dn{jl) \X^Re7t}) 

□ 

Obtaining a type projector from a set of rules returned by the judgment is straightforward. 
In essence, the derivation collects in n the rules of E that are sufficient to answer the query. 
Since in general not all rules in E are kept, then the rule in n may use names that are not 
defined in n. Therefore, the erasure operation (defined in Definition 12.10b simply removes 
references to names not defined by any rule in n (the definition of R\y is straightforward: 
it is R where every occurrence of a name in is replaced by e). 

The rules in Figure |2]re fleet the intuition we gave earlier. At each step, we execute the 
type inference algorithm on the current set of rules and accumulate only those for which 
the resulting type is not empty. Informally, each rule preserves the following properties: 

weli-formedness: if a rule Y — > R is added to the type projector n then there must be a rule X — >• R' G K 
such that Y G Names (R 1 ). 

precision: given a path P and a rule Y — > R. If ({Y — > R}, — ) P : (0, — ) then Y — > R must 
not be added to the projector. 

Let us explain how the different rules preserve these properties. The easiest case is the 
one of a query consisting of a single step, handled by the Rule (p-step). In this rule, we 
just apply the type inference algorithm to determine the output type of the results. The 
resulting projector is the set of rules in the results to which we add their upward context K, 
that is the rules linking the results to a start symbol. The rules (p-union) and (p-iterate) are 
only inductive cases which allows us to handle respectively top-level union and projectors 
applied to a set of rules. In particular, (p-iterate) splits the checking of a path over all 
the possible rules specified in the type component of the environment (each one identifies a 
different set of current nodes). This allows us to define the so-called "Path Rules" in much a 
simpler way since they can be written for environments in which the type component is just 
a singleton. The Path Rules actually perform the projection and they all follow the same 
scheme. The Rule (p-test) handles a simple node test. If the type inference returns some 
non-empty type E t yp for the step, then we can compute the projector for the continuation P 
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and add its result to the rule for the current node. The Rule (p-predicate) is similar: the type 
£ t yp returned by the type inference is the set of nodes for which the predicate is satisfied. 
We then recursively compute the projector for the continuation P as well as for the paths P(j 
occurring in the predicate. In the end, we return the union of all the computed projectors 
to which we add the rule for the current node. Again we only do this if the type inference 
returned a non-empty type. The following rules handle the actual navigation. They are 
split in two sets, one for the parent and child axes another for their recursive variant, 
ancestor and descendant. Since they are the most delicate rules let us explain them in 
details. The two cases are similar. In the Rule (p-single), the algorithm first retrieves all the 
rules matching the axis (child or parent). These rules are collected in X and the analysis 
yields a current context k' . Then, by using n calls to the type inference algorithm (n being 
the number of rules in t), it collects among T only the rules which are a suitable starting 
point for the rest of the path, that is all the rules yielding a non-empty result type when 
typed against P. These rules are collected in t' which, as it can be easily seen, a subset of 
T. Finally, %' and k' are used as the environment to infer the projection with the rest of the 
path. The (p-many) rule handles the recursive axes, descendant and ancestor. The rule 
is almost the same as Rule (p-single) with the exception that it does not test whether the 
continuation P yields a non empty result on the node but on a descendant (or ancestor) 
of the node, to ensure that we put not only the correct rules in the projector but also the 
rules leading to them, and therefore that we maintain well-formedness. If for any of these 
rules one of the side conditions does not hold, then the rule (p-erase) is applied and returns 
an empty projector for the current path. 

Before proving the formal properties of the type-projection inference, we illustrate its 
behavior by unrolling it on an example. Consider the grammar: 

{A — > a[(B\C)*],B — > b[D],C — !> c[E],D — > d[E},E — !> e[ ]} 
with start symbol A and the path P: 

descendant:: node/self :: e/ancestor::node/ self ::b 
which is the single step normal form of 

//e/ancestor::£> 

To ease the reading, we identify every rule with the non-terminal it defines. Therefore in 
what follows when we write, say, A in types or contexts, we actually mean A — > a[(B\C)*]. 
The algorithm computes the type projector for P as follows. The initial environment is 
({A}, {A}). We apply the rule (p-many) for the first step. The first premise computes 
the type of descendant ::node applied to A, which returns the type and context (these 
instantiate the (V, k') of the rule): 

({B,C,D,E},{A,B,C,D,E}) 

Then the second premise filters out the unwanted names and keeps only those for which the 
whole path may succeed. This gives us an intermediary type: {B,D,E} (and unchanged 
context) onto which we can compute the projector for the path: 

child:: node/self ::e/ancestor :: node/self ::b 

the final result for this rule will be the projector for the above path to which we add {A} ( i). 
At this point, since the input type contains many rules, we can apply the rule (p-iterate) 
which will apply the continuation path on {B}, {C} and {D}. It is easy to see that on {B} 
the side condition for the rule (p-single) is not fulfilled, since the type inference returns 
empty. The rule (p-erase) applies and returns an empty projector. The projection continues 
with only {C} and {D} left (the context is unchanged until now, {A,B,C,D,E}). First let's 
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consider the derivation for {£)}. The current step is child:: node which was introduced by 
the previous (p-many) rule. On this step, we apply the rule (p-single). This rule adds {D} 
(ii) in the final projector and continues by computing a projector from {E} using the path: 

self ::e/ancestor :: node/self ::b 

When we apply the same rule to {C} however, while the first premise returns a non empty 
type, the second one returns an empty result, since from a node with type C the path 

child::node/self ::e/ancestor ::node/self ::b 

yields an empty result. Thus the rule is not applied and the result of the projector for the 
remaining path for the node type {C} is the empty projector. We continue with our only 
set, {£■}. We compute the projector for self ::e which adds {E} to the final projector (Hi) 
and computes the projector for the path: 

ancestor ::node/self ::b 

It is easy to to see that these will return {B} as a projector (iv). If we summarize, we obtain 
from (i, ii, Hi, and iv) the set of rules 

n = {A -> a[(B\C)*], B -> b[D],D ->• d[E], E -» e\ }} 

The actual type projector is: 

(ynDn(n), {(X -^R)\ Dn(jl) \ X R e 7t}) 

that is: 

({A}, {A -> a[B*],B -> b[D],D ->■ d[E],E ->• e[ ]}) 

This example shows how the two properties of precision and well-formedness are pre- 
served: 

well-formedness: what we obtained at the end is a valid type without unneeded rules. 

precision: although the query references e nodes explicitly, we do not naively keep all the e 
nodes but only those that are useful to compute the query, namely those occurring 
below a b node. 

We can now present the formal properties of type-projection inference 

Lemma 5.12 (Termination of type-projector inference) Let {S^,E) be a type, P a path, 
and E and E environments. The judgment E Ihg P : Y! has a unique and finite derivation. 

The lemma above states that the rules in Figure [2] describe a terminating algorithm. We 
show now that they compute a type-projector by formalizing the "well-formedness" prop- 
erty that we outlined above. The intuition is that when the output type for a step is computed 
(e.g., in the first premise of the rule (p-predicate)), then the context corresponding to this 
computation is kept and passed as a parameter for the inference of the remainder of the 
path. On the last step, (rule (p-step)) the context is added to the type projector. There, it 
ensures that whenever a rule Y — > R is added to the type-projector, all the rules needed to 
derive Y — > R from the start symbols are added to the type-projector as well. This is what 
we formally state by the following lemma: 

Lemma 5.13 (Well-formedness of type-projector inference) Let (S?,E) be a type, x, z', 
and K sets of rules, and P a path. If (t, k) \\~e P '■ tf > then (t, k) P : (t", k") implies 
k" C t'. 

We can now state the soundness of type-projection inference: 
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Theorem 5.14 (Soundness of type-projector inference) Let (S",E) be a type and P an 
XPath e query. Let S be the set of rules: S={X-tR\X£ S*}. If 

(S,S) lh £ P: t 

then X is a type-projector for (5? ,E) and for every t £3 {5? ,E) we have: 

lPl\ dZ (RootId(t)) = {PURootId{t)) 

In words, if T is the projector inferred for a query P and a grammar (J?,E), then for every 
tree t validating the grammar, the result of executing P on t or on its pruned version t\jT 
is the same. 

Completeness requires not only completeness of the type system (^-guarded, non- 
recursive, and parent-unambiguous DTDs), but also the following condition on queries: 

Definition 5.15 An XPath query Q is strongly-specified if: 

i. its predicates do not use backward axes, 

ii. along Q and along each path in the predicates of Q there are no two consecutive 
(possibly conditional) steps whose Test part is node 

Hi. each predicate in Q contains at most one path and this does not terminate by a step 
whose Test is node. □ 

For instance, among the following queries, only the first two are strongly-specified: 

- descendant:: node/ self :: a I ancestor:: node 

- descendant:: node [ child ::b~\ / self :: a/parent :: node 

- descendant:: node/ ancestor:: node/self :: a 

- descendant:: node [ child ::£>/child:: node] / self ::a 

- child:: a [ descendant:: node/parent :: b/ child:: c] 

In the third query, there are two consecutive steps with a "node" test, which violates condi- 
tion (ii). In the fourth query the predicate contains a path ending with "node" — failing to 
satisfy condition (Hi) — and for the last query, the predicate contains backward axes, which 
violates condition (;). 

Once more, we are in presence of a very common class of queries: for instance, almost 
all paths in the XMark and XPathMark benchmarks are strongly specified. 

If all the conditions are met, then we can show that our algorithm is complete, in the 
sense that it infers the best possible sound projector. In words, if we remove any rule 
(and its consequences) from a projector inferred for a path P and a grammar (J7,E), then 
we obtain a projector for which there exists a tree t validating the grammar for which the 
execution on t and on its pruned version yield different results. Formally: 

Theorem 5.16 (Completeness of projector inference) Let (y,E) be a ^-guarded, non- 
recursive, and parent-unambiguous local tree grammar, and P a strongly-specified XPath 1 
path. Let S be the set of rules: S= {X ^R\X e -Y}. If 

(S,S)\h E P:r 

then there exists t e 3 (^ ,E) such that for each Y ^ R e T, if X = t\ ({Y — > R} U 
Ae({Y —>R}, descendant)), then: 

lP} t \, n (Rootid(t)) ^ \Plt(RootId(t)) 
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The fact that completeness may not hold for not ^-guarded, non-recursive, or parent- 
ambiguous local tree grammar, is a consequence of the analogous property of the type 
system. To see that also strong-specification is a necessary condition consider documents 
valid with respect to the following grammar rooted at X: 

{X->a[Y,W], W ->c[],Y ^>b[Z], Z -></[]} 

If we query a document of that type with the following non strongly-specified query (it 
does not satisfy (Hi)) 

self ::a [child:: node] , 

then {X,T} is an optimal projector for this query (once more, we use a name to denote the 
rule that defines it), but the presence of the condition child:: node forces the system to 
include also W in the inferred projector, thus breaking completeness. A similar reasoning 
applies for self :: a[child:: b or child :: c], which does not satisfy condition (Hi) because 
of the presence of multiple path in the predicate. Concerning the presence of backward axes 
in predicates, consider the query 

self ::a [descendant::node/ancestor::a] 

which does not satisfy condition (i). An optimal projector for this query on the same 
grammar is {X,Y}. However, since the ancestor condition is true for all descendants of a 
nodes, {W,Z} is included in the projector as well. Finally, with a similar reasoning on the 
same grammar, it is clear that the query 

descendant:: node/ancestor:: node/ self ::a 

for which condition (ii) does not hold, jeopardises completeness. The first step selects all 
the rules in the grammar that can be derived from the start symbol (that is, all the rules). 
None of these rule are discarded by the projector-inference since for none of them the 
output type of 

ancestor::node/self ::a 

is empty. The point here is that for the given grammar, there is no need to keep all the 
nodes, but only one child of the root. Indeed, having one element below the root guaranties 
that the sequence descendant:: node, ancestor::node is not empty and therefore that 
the root can be selected. 

Of course, it is possible to state completeness for other classes of queries but, once 
more, this seems a satisfactory compromise between simplicity and generality. 

6 Extension to full XPath 

The formal developments of the previous section only deal with the XPath* language. This 
language allows one to express structural queries, that is, queries whose predicates contain 
only conjunctions or disjunctions of paths. In this section we show how to translate a (full) 
XPath query into a (set of) XPath* queries and perform type-projection inference for the 
latter that is sound for the former. In other terms, we show that our translation is a sound 
approximation with respect to type-projection. Finally, we also show how to encode the 
XPath axes not present in XPath' and how to extend our theoretical framework to handle 
most XML and XPath peculiarities (attributes, absolute paths,. . . ) 

6.1 Handling XPath predicates 

We extend Definition|4J]to XPath 1.0 paths (051): 
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Definition 6.1 A path is a finite production of the following grammar: 



Path ::= Step \ Path/Path \ Path\Path 

Step ::= Axis : : Test \ Axis : : Test [Cond] 

Axis ::= self | child | descendant | parent | ancestor 

Test ::= tag | node | text 

Cond :: = Cond or Cond | Cond and Cond | Expr 

Expr :: = Expr cmp Expr \ Arith 

Arith :: = Arith op Arith \ Atom 

Atom ::= / (Expr, ... ,Expr) \ Path \ v 



where: 

tag ranges over element tags 
cmp £{= /=<=<,>,>=} 
op G {+,-,*, div, mod} 

f ranges over a set of built-in functions of the Core Function Library, 
v ranges over values: strings, sequences, integers,. . . 

□ 

We wish to provide a safe translation from an XPath query Q to an XPath* query P that 
approximates Q and use it to infer a type projector. By safe we mean that the type-projector 
inferred for P must not change the semantics of Q. 

What exactly is an approximating query in this context? A naive approach to define 
query approximation is to consider inclusion of the results. According to it the query P 
translation of Q should always select more nodes than Q. However this works only as long 
as we do not use non-structural conditions (that is, predicates that make a query be non- 
structural). This is clear for example when we use the negation function not. Consider the 
query 

descendant :: a [not (child:: b) ] 

For all documents, the query descendant ::a returns more results than the query above. 
However, a projector inferred for descendant :: a would discard b nodes not occurring 
before an a node, and therefore possibly also some b nodes children of an a node. In this 
way it would change the result of the original query. What the approximating query needs 
to reflect cannot be defined in terms of inclusion of results but rather in terms of data-need. 
We must ensure that the approximation traverses at least the same nodes as the original one 
to ensure that the former will not be pruned. However, we want the approximation also 
to be as precise as possible. For instance "descendant ::node" is a sound approximation 
for any XPath query but the projector we infer from it is utterly imprecise: it performs no 
pruning. 

As the reader will have understood, the tricky part is to approximate non-structural 
conditions. We do it as follows: 

Definition 6.2 (Approximation of a path) Let P and S respectively denote a path and a 
set of paths of XPath 1 . Let P/S denote the set ofXPath e paths defined as U/"es{^ > /^ > '}- 
Given an XPath expression Q, its approximation P(Q) is the set ofXPath^ paths defined as: 

P(2i|e 2 ) = P(Gi)uP(fi 2 ) 

P (Axis:: Test /Q) = Axis:: Test /P(Q) 
P(Axis::Test[C]/Q) = Axis::Test[C(C)]/P(Q)UAxis::Test/S(C) 
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where: 



cop) 
cop) 



p 

self ::node 



ifP(P) = {P} 
ifP(P) ? {P} 



C(Cj orC 2 ) 
C(Ci andC2) 
C(C) 



C(d) orC(C 2 ) 
C(Ci) and C(C 2 ) 



self ::node 



otherwise 



and: 



sop) 
sop) 



pop) 







W) = {p} 
imp) * {p} 



S(Ci orC 2 ) = S(Ci)US(C 2 ) 

S(Ci andC 2 ) = S(Ci)US(C 2 ) 

S(CiopC 2 ) = S(Ci)US(C 2 ) 

S(CicmpC 2 ) = S(Ci)US(C 2 ) 

S(/(d,...,C„)) = F(/(C I; ...,C„)) 



The most technical point in the definition above is, as expected, the approximation of con- 
ditions, implemented by the auxiliary functions C() and S(). To be precise, we differentiate 
between purely structural paths and non structural paths. For a structural path, P() does not 
introduce any approximation and returns the singleton containing the path itself. Otherwise, 
a non-structural path is approximated by a set of paths. The translation is non trivial when 
the path contains non structural conditions. Let us illustrate the rationale of the definition 
first by an example. The path 

descendant ::a [(count(child::£>)>3 and child::c) or descendant ::b~\ / child::rf 

is approximated by the following set of two paths 

{ descendant ::a [self ::node and child::c or descendant::/?] /child::c/, 

descendant ::a/child::£>} 

The first is generated by an application of the function C(), while the second derives from 
the application of S(). As we see, the arithmetic expression count(child::fr)>3 is ap- 
proximated by the function C() into the self ::node path occurring in the first path of the 
set. This condition is always true and therefore it is a sound approximation of the Boolean 
value of the expression (since the result is always true the type-inference algorithm will 
never be able to deduce an empty output type for this sub-path and therefore the type- 
projector inference algorithm will keep the rules associated with this node). However this 
is not sufficient to ensure the safety of type projection. Indeed for this test to be possible 
at run-time, the projected document must have the "b" nodes that were below a nodes in 
the original document. This approximation is made via the second path by the S() func- 
tion and, in particular, by the F(count(child::£>)). Of course, what the function actually 
does depends on the semantics of the built-in function. For instance, count (P) returns the 
number of nodes selected by P, thus a projector keeping the type of the nodes selected by 
P is sound. On the contrary, the function string (P) when applied to a node set returns 
the concatenation of the string-value of all the nodes in the set. The string-value of a node 
is the concatenation of all the PCDATA elements occurring below it. Therefore a suitable 
approximation for string(P) is not P but rather P/descendant ::text. Giving an ap- 
proximation for all the functions of the XPath Core Library is a tedious task. Although our 
prototype implements approximation for all functions, in Figure [3] we just give an excerpt 
that completely covers all the different techniques we used in our prototype to approximate 
built-in functions. 



and F(_) is the approximation of built-in functions (see Figure\3\for an excerpt). 



□ 
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6.2 Other XPath features 



We purposely left out from our definitions some features of XPath that would have led to 
a much more intricate formalization process, in particular for what concerns definitions of 
the algorithms and the proofs of the theorems. Here we illustrate how these features can be 
either encoded or approximated within our framework. 

6.2.1 descendant-or-self and ancestor-or-self axes 

These axes — that we used in Figure [3] — can be encoded exactly by using the " |" operator. 
Precisely 

P/descendant-or-self "Test [ Condi /P 
can be equivalently written as 

^/(descendant y.Test \_Cond~\ | self :: Test [Condi )/P' 

6.2.2 Sibling axes 

We could have defined a sibling relation over node identifiers in the same way as we defined 
the edge relation Edg in Section|U and used it to deal with the following-sibling and 
preceding-sibling axes natively. However we can also approximate these axes using 
only "vertical" moves. So for instance 

P/f ollowing- sibling y.Test [Condi /P 

becomes: 

P/parent :: node/child:: Test [Condi /P 

The transformation above approximates the following siblings of a node by all its siblings, 
including itself. Our experiments showed that, as far as type-projection is concerned, this 
kind of approximation does not yield any noticeable loss of precision in practice. 

6.2.3 preceding and following axes 

For these axes, we can directly use the W3C recommendation 11351 and encode following 
accordingly. That is, 

P/following-Pesf [Condi /P' 

becomes 

P/ancest or :: node/following- sibling ::node/descendant-or-self y.Test [Condi jP 

6.2.4 Document node 

The XPath data model enforces the presence of a document node, the real root of the doc- 
ument which has no label and is selected by the initial "/" of an XPath expression. It is 
of course possible to represent such documents in our framework but we preferred to omit 
it here since it would cause many presentation issues with little theoretical interest. In 
particular, the document node is never referenced by the schema of the document. 

6.2.5 Absolute paths 

Absolute paths are paths with a leading /. They do not start their evaluation from the 
current context node but from the root of the document. Our formalism easily allows us to 
encode absolute paths. First, if an absolute path occurs outside of a predicate, as in: 

P/(P|/P 2 )/P' 
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then we can simply rewrite it as: 

(P/P l /P')\(P 2 /P') 

Second, if the path jP occurs in a predicate, then we can replace it with self :: node 
(as if it was a non structural condition) and add P(/P) to the global approximation. Direct 
treatment of absolute paths would have further complicated Definition ^. 21 where we would 
have had to maintain a set of absolute approximations, modified only by absolute paths and 
propagated at each function call. We chose not to clutter this definition (but absolute paths 
are handled by our implementation). 

6.2.6 attribute axis and attributes in the data-model and schema 

Conceptually, the attribute axis is not very different from the child axis, and could be 
encoded as such. For instance a possible solution would be to encode an element 

<e att="value" id="34" ><a/><b/x/e> 

as the tree: 

e[ @[att["value"] id["34"] }a[]b[}} 

by introducing a phony node with label @ . If such a solution were retained then we would 
also need to update the definitions of child and descendant to ignore @ nodes, and add 
an attribute axis selecting only the content of such nodes. 

As far as schemas are concerned, they need to reflect the uniqueness and unorderedness 
of a sequence of attributes within an element node. This can be done with a union type. 
For instance, the document above could have type: 

E — » e[ATTS AB } 
ATTS — » @ [(ATT ID) | (ID ATT)} 
ATT -> att[String] 
ID — > id[String] 
A -> a[] 
B H- b[] 

this encoding however incurs an exponential blow-up in the size of the sequence of at- 
tributes. Our implementation follows a much more pragmatic approach. Precisely, even 
though attributes could be encoded in our approach we preferred to add an unordered attrib- 
ute construct directly at the grammar level and specialize type-inference and type-projector 
inference rules for attributes. 

6.2.7 id() function 

The id ( ) function of XPath is peculiar in the sense that unlike other functions, it does not 
take the context node as implicit argument (e.g. the position ( ) function returns the posi- 
tion of the context node within the current result set). Rather, the expression "id("f oo")" 
returns the node whose id is "f oo" if it exists (a node has id "f oo" if it has an attribute 
named id whose value is "f oo" and if this attribute has been declared with type ID in the 
Schema, [35]). We choose to approximate this function in two steps. First, we rewrite it 
as an absolute path. Then we can let our approximation algorithm handle the absolute path 
(with the technique described in Section l6.2.5b . For instance an expression such as 

id ("item34") /child:: name 

can be rewritten has 

/descendant:: * [@id="item34"] /child::name 

This rewrite technique was used in particular to handle queries C5-C7 of the XPathMark 
benchmark (see Section|9). 
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7 Extension to XQuery 



In this section we extend our technique to XQuery. 



Definition 7.1 (XQuery) 



FLOWR 
FORLET 



FORLET return ExprS | ExprS 
FOR LET 



FOR 

LET 

ExprS 

Cond 

Expr 

Arith 

Atom 



P 



for $x in ExprS 
let $x : = ExprS 

if ExprS then Cond else Cond \ Cond 

Cond or Cond \ Cond and Cond | Expr 

Expr cmp Expr \ Arith 

Arith op Arith \ Atom 

f(Expr, . . . ,Expr) | FLOWR /P \x\v 

FLOWR, FLOWR | <tag>FLCWfl</tag> | () 

Step[Cond]/P \ Step/P \ Path 



where x ranges over identifier names, v ranges over values (such as integer and strings), 
cmp ranges over { =, != ,<, >, >=, <=}, op ranges over { +,-,*, div, mod} and Path and 
Step are the same as in Definition ^. II that is they denote step and path expressions free of 
any XQuery construct. 



For the sake of clarity and concision we only considered formally a subset of the XQuery 
grammar ( 11361 ). In a nutshell, the definition of Atom (given in Section[6]i is extended with 
two new constructs: variables (ranged over by x, y, z in what follows) and path applications 
FLOWR/ P. 

Note that XQuery constructs may occur inside a path expression (production P) or 
not (production Path). Also, we consider neither queries that first construct new ele- 
ments and then navigate on them (these are rarely used in practice) nor queries containing 
"order by", "switch case", etc. constructs. XQuery queries are ranged over by q. In 
order to apply the previous analysis to infer a projector for an XQuery query q, we first ex- 
tract a set of full XPath expressions from q, denoting the data needs for q. Then, we apply 
to each of these extracted paths the approximation function P(_) given in Definition 16. 21 to 
obtain an XPath f expression. We can finally use the projector inference algorithm of Sec- 
tion [5T2] on the set of approximated paths, which is a sound type projector for the original 
XQuery query q. 

Path extraction is performed by the extraction function E(_,_,_), whose definition is 
given in Figure [4] The extraction function has the form E(q,F,m) and performs a straight- 
forward recursive descent over its first parameter q which is the query at issue. The second 
parameter F is an environment, that keeps track of bindings of the form (x;P) in whose 
scope q occurs. Finally, m is a flag indicating whether q is a query that serves to mater- 
ialise the full content of the queried elements (m — 1) or if the query just selects a set of 
nodes whose descendants are not needed (m = 0). Before explaining in details the rules in 
Figure|4] we introduce two auxiliary functions. The first one is M(_,_) (Figure|5j which 
given a built-in XPath function and the position of one of its arguments, returns a suitable 
value for the parameter m (intuitively, M (/,?') returns 1 if / needs the full content of its 
/ th arguments and otherwise). This function is similar to the function F(_) introduced in 
Section [6] Figure [3] 

The second one, E'(_,_,_) is defined mutually, together with E(_,_,_) and allows to 
recursively traverse XQuery expressions and resolves the variable names they contain. It 
works similarly to E(_,_,_) but do not returns sets of XPath paths, but sets of particular 
XQuery expressions which do not contain any variables. 



□ 
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Now that we have introduced environments and the auxiliary functions, we can easily 
describe the rules in Figure |4] First, rules 1 and 2 form the basic case of the recursive 
descent and return the empty set if the whole query consists of a constant. Rules 3 and 
4 straightforwardly apply the extraction recursively for the content of sequence (Rule 3) 
and element (Rule 4) constructors. Rules 5 and 6 handle the case of variable bound in the 
environment Rules 7 and 8 add a constant path to the set of extracted path, according to the 
value of the parameter m. Note that in those rules, Path refers to the corresponding entry 
in the grammar of Definition 16. II that is it does not contain any XQuery construct and only 
pure XPath ones. Path containing XQuery expressions are handled in the subsequent rules. 
Rule 9 handles the application of a path FLOWR expression with a path P. Note that as 
previously the notations Si /S2 where S\ and S2 are sets of paths stands for: 

U UiP/p'} 

P£S,p'es 2 

The case of a simple step composed with a path expression is handled similarly by Rule 10 
and we recall that the notation Step /S where S is a set of path is syntactic sugar for the set: 

{J{Step/P} 

Pes 

Rule 1 1 is more intricate, but its complexity is only bureaucratic. This rule, which did not 
exist in the previous version of our work ([.5|), allows the extraction process to extract Full 
XPath expressions. In the present work, path extraction and path approximation are two 
separate processes. Path extraction only occurs at the level of XQuery terms and returns 
sets of full XPath expressions (which reflects the exact path that may be evaluated during 
query execution). Approximation from XPath to XPath ^ is handled at the XPath level. The 
issues solved by Rule 1 1 is to recursively traverse an XQuery expression, using a recursive 
call to the auxiliary function E '(_,_,_) which builds a set of XPath conditions into which 
all variable bindings have been resolved. Therefore what we obtain after E'(_,_,_) is a 
set of XPath conditions free of any XQuery construct (especially variables). We can now 
explain how E'(_,_,_) works. In Rules 1' to 3' use a recursive descent into the production 
of the XQuery grammars, starting at the condition levels and reconstruct Boolean XPath 
condition (1'), relational XPath expressions (2') or arithmetic XPath expressions (3')- More 
interesting is Rule 4' which traverses the arguments of a function call and uses the auxiliary 
function M(_) to determine a suitable value for m. Lastly, if the input matches any other 
constructs Rule 5' applies and recursively applies E(_,_,_) to construct a set of XPath paths. 

We can resume our description of E(_,_,_) for the remaining cases, the high level con- 
structs "if then else", "let return" and "for return" handled by Rules 12, 13 and 
14 respectively. Rule 12 recursively extracts paths on the Boolean test q, the "then" case 
qi and the "else" case qi. The only point of interests is that the Boolean test cannot gen- 
erate a result and therefore can be called with parameter m = 0. The let binding handled 
by Rule 13 augments the environment F with the path extracted from q\ and extracts the 
paths of query qi in this augmented environment. Note that the path bound to x are added 
to the final results by Rule 5 or 6 only if the variable is used. On the contrary in Rule 14, 
for loops will perform their iterations even if the bound variable is never used, as long as 
the paths extracted from q\ yield a non-empty result. It is therefore mandatory to add the 
paths extracted from q\ to the final result. 

These rules subsume and enhance the technique of Marian and Simeon 1271 . In par- 
ticular, (i) the technique we use to exclude useless intermediate paths is simpler and more 
compact, (ii) we do not need to distinguish between two kinds of extracted paths but, more 
simply, we always manage a unique set of path expressions, and last but not least, (Hi) 
our path extractor can be used even if the user cannot access an XQuery to XQuery-Core 
compiler, which is necessary for 11271 . 

Before applying the extraction function E(_,_,_) to some query q we apply some heur- 
istics that rewrite q so as to improve the pruning capability of the inferred paths. Among 
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these heuristics the most important is the one that rewrites 

for y in g/descendant-or-self : :node 
return if C(y) then q else () 

into 
for y in 

g/descendant-or-self : :node [C(self :: node)] 
return q 

whenever C(y) is a condition referring only to y and does not use external functions (C(self :: 
node) is obtained by replacing self :: node for all occurrences of y free in C). If we apply 
E(_,_,_) to the first query, then a path ending by the step descendant-or-self ::node is 
extracted thus annulling further pruning: the entire forest selected by Q is loaded in main 
memory. This also happens with the approaches of Bressan et al. fT3l and of Marian 
and Simeon l27ll . In ours and Marian and Simeon's approach the query can be rewritten 
as above, while this is not possible with Bressan et al. formalisms since their subset of 
XQuery does not include predicates. However, Marian and Simeon's path based pruning 
degenerates (no further pruning is performed) also for the second query, since the step 

descendant-or-self ::node 
ends up in the set of pruner paths, thus selecting all nodes. This is because their approach 
cannot manage predicates. In our approach instead predicates are taken into account and 
therefore only nodes satisfying C(y) are kept by the projector, thus yielding a very precise 
pruning. 

It is important to stress that despite their specific form the first kind of queries is very 
common in practice since they are generated from XQuery — sXQuery-Core compilation of 
a non negligible class of queries or when rewriting upward axes into downward ones. This 
latter observation shows that the application of rewriting rules of ll3D to extend Marian 
and Simeon's approach to upward axes is not feasible since the rewriting may completely 
compromise pruning. 

8 Extension to other typing policies 
8.1 Handling un-typed documents 

Although the usage of schema is being more and more wide-spread, it still is interesting 
to see how to perform type-based projection in an untyped world. A first, rather blunt, 
approach is to consider a fixed corpus of un-typed documents. For such sets of documents 
it is possible to infer a Dtd. For instance, Bex et. al. propose several automata-based 
methods to infer DTDs iflOll and even XMLSchemas ( lfTTl l9l). Once a schema is inferred, 
our technique can be applied as-is. 

More interestingly, this untyped problem can be reduced to a precise typing problem. 
Indeed, an un-typed document is nothing but a document of type ({X},{X — > Any}). If 
we apply the type inference-algorithm of Section l5Tl to such an input type, then the result 
would be ({X}, {X —> Any}) itself (meaning that the nodes selected by the query have type 
Any). Therefore in this case, since none of the intermediary steps of the query results in 
an empty-type, the type-projector inference algorithm of S ection [5721 cannot remove any 
rule from the input type which remains ({X}, {X — > Any}): the input document cannot be 
pruned. However, even though the input type does not contain any meaningful information, 
the query itself might. Imagine a query "//a/b". It is easy to deduce, by a simple exam- 
ination of the query a projector which keeps only "b" nodes occurring below "a"-nodes. 
While the solution in this case is straightforward, solving this problem in general is a tricky 
issue. The solution for a forward fragment of XPath can be found in the last author's PhD. 
thesis (see Chapter 7 of 11301 ). Let us briefly outline it on our example. The first issue is the 
representation of types. For such precise algorithms, regular tree grammars are not well 
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suited. Indeed instead of the type ({X}, {X —> Any}), it is more desirable to have a type 
({X},E) where E is the set of rules: 

{X^ String, X^_[X*}} 

where _ denotes the set of all possible tree labels. Then the result of a query I la/b applied 
to a a tree of the above type (i.e., any XML tree) would be the type projector: 

{X->_[X+], X->a[B+], B—tb[Y*], Y -> String, Y ~^_[Y*}} 

Note that this type-projector is non-deterministic top-down. It matches (and therefore 
keeps) any subtree t if and only if there is a subtree t' of t with tag "a" which itself has 
a non-empty sequence of children tagged "b". Nodes that are children of an "a" node but 
whose tag is not "b" do not have any interpretation and therefore are discarded. 

In l30l . in order to achieve such a precise typing, the inference algorithm makes a heavy 
use of CDuce's type algebra (see J22]) in particular of intersection and negation types. Also 
note that the projector above is not obtained by erasing some of the rules of the original 
type but by the mean of set theoretic operations . In fact, three new rules were created and 
intuitively they were obtained by intersecting the initial "X — > _[ X* ]" rules with a type 
"a[ B+]" which is the constraint represented by the XPath query. Note also that contrary 
to our approach where new rules are only erasures of existing ones (of which there only 
exists a finite number), special care must be taken to not introduce infinitely many refined 
rules or, said differently, in this context even guarantying the termination of the algorithm 
is a very delicate issue. 

8.2 Using regular tree languages as schemas 

While our formal development remains in the very general case of regular tree grammars, 
our implementation only focuses on DTDs. The main reason is that for DTDs pruning is 
efficient memory-wise. For regular tree languages instead, validation (and pruning) may 
need to visit the whole tree before deciding which node to prune. At first, it seems that 
this completely defeats the purpose of pruning, but we argue that pruning can still be of 
practical use in these cases. 

Indeed, a way of addressing this problem is to temporarily store the document in 
memory in the form of a succinct tree data-structure (based for instance on balanced paren- 
thesis: a survey of the most popular succinct tree representations can be found in |2|). The 
final data-structure (e.g., a DOM representation) of the document can then be built from 
the temporary one, by replaying a sequence of SAX events while traversing the temporary 
data-structure and by not synthesizing events for pruned sub-trees. An alternative solution 
is to store on disk the sequence of SAX events and process it backward, thus simulating a 
bottom-up evaluation (validation of regular tree grammars and therefore projection can be 
done in a deterministic bottom-up fashion). Such a technique was used in l26l to efficiently 
evaluate node selecting queries bottom-up on documents up to 1 GB of size. 

9 Experiments 
9.1 Prototype 

To gauge the benefits of type-based projection, we have implemented our pruning algorithm 
into a prototype. Our prototype takes as input an XQuery query, a Dtd, and a document. 
It then performs the path extraction described in Section |7]and computes for each extracted 
path its XPath^ approximation, applying the rewriting rules given in Section [6] Based on 
this set of paths, our program performs the static analysis described in Section lSSl and com- 
putes a type projector. Once this is done, the prototype parses the input document, prunes 
it according to the inferred type-projector and serializes the result in a new document. 
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Besides what is included in the formal description of our algorithm, our prototype is 
extended also to support the full set of XPath axes as well as attributes. If we call D the 
input document and S the input Dtd, then, assuming that D is well-formed, the pruning 
process is performed in 0(|D|) time and 0(|S|) memory, where \D\ and |5| denote the size 
of D and S respectively. Indeed, the type projector associated with a DTD is at most as 
big as the original schema (when no pruning is performed) and 0(\S\) space is required to 
store it in memory. Our prototype can also perform well-formedness check and validation 
while pruning, in which case time complexity remains 0(|D|) and memory complexity be- 
comes 0(\S\ +log(|D|)) (it is well-known that checking well-formedness during validation 
requires to keep a stack whose is size at most the height of the document, see e.g. |[3"3lD . 

Our prototype is implemented in OCaml, using the PXP library for XML and Dtd 
parsing. 

9.2 Benchmark suite 

We used the XPathMark (EJUS)) and XMark ( 021 ) benchmark suites. The former con- 
sists of a large set of XPath queries while the latter provides XQuery queries to test against. 

9.2.1 Data-set 

Both XPathMark and XMark use the XMark document generators. These documents com- 
ply with the "auction" Dtd representing an auction web-site. It defines 77 element types 
and 15 attributes. This size and complexity is comparable to "real-life" type definitions 
(for instance the XHTML transitional Dtd also features 77 element definitions). Because 
the "auction" Dtd falls outside the conditions of our completeness theorem (it features 
recursion and unguarded union), it is a very good test-case to illustrate the precision we 
achieve in practice even when completeness does not hold. The scalability of our approach 
was tested by using documents of varying size, ranging from 12MB to 3GB. An important 
aspect of the XMark generator is that the proportion of textual data versus tree structure 
stays the same, for all size of documents. We report here some statistics of interest which 
we use later-on to gauge the precision of our pruning algorithm. An XMark generated file 
consists of: 

• 74% of text content (as PCDATA element or attribute value) 

• 65% of all the text content (that is, 49% of the total file size) resides in a description 
element or one of its descendant. 

9.2.2 XPathMark queries 

Since its original publication (J2T]), the XPathMark benchmark suite has evolved to provide 
a very complete set of XPath queries. It is composed of functional test queries, aiming at 
ensuring the correctness of an XPath implementation and performance test queries which 
provide computationally difficult queries. We highlight some of the main design goals of 
this test suite (the complete rationale can be found in l20l ): 

1. queries simulate realistic query needs of a potential user of the the auction site; 

2. queries are divided into groups according to the intrinsic computational complexity 
of the corresponding evaluation problem. XPath language can be stratified in a num- 
ber of fragments for which different complexity bounds are known BJ. Comparing 
the theoretical computational complexity of the query evaluation problem with the 
actual amount of resources consumed during query evaluation might be, at least, a 
stimulating and instructive exercise; 
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3. queries are defined to challenge data scalability of the XML processing system, that 
is the performance of the system as the data complexity (document size) grows. In 
particular, the queries talk about document sections (like open and closed auctions, 
items, people, descriptions) that become bigger when the XMark document scaling 
factor increases. Moreover, the results of the queries are small compared to the size 
of the target document. This avoids that the time taken to serialize the results (that 
may be relevant) obfuscates the pure query processing time. 

These three points are exactly those we aim to address with the present work. Indeed 
our approach drastically increases the data scalability (3.) of XML processing systems for 
realistic queries (1.) and potentially complex ones (2.). XPathMark queries are divided in 
5 groups, labeled from A to £ that we briefly describe now. 

Group A contains unary tree pattern queries. These queries use only child and descendant 
axes, node tests equal to * or to a tag name, and filters (predicates). Conjunctive and 
disjunctive Boolean operators are allowed, but negation is not. Relational and arithmetic 
operators and functions are disallowed. These queries fall therefore in the category of 
queries that we handle without any approximation. 

Group B contains the so-called core or navigational XPath queries. This fragment ex- 
tends XPath-A by admitting all XPath axes and negation. It mostly corresponds to queries 
for which our algorithm introduces a very lightweight approximation (we only need to 
approximate negations and those axes we did not treat formally such as preceding). 

Group C extends Group B with relational operators (=, !=, <, >, >=, <=) and the id() 
function. 

Group D extends Group C by allowing all arithmetic operators (+, -, *, div, mod) and 
functions sum() and count (). 

Group E contains all XPath 1.0 queries. In particular, it extends Group D by allowing 
all functions (like positionQ and containsO). 

XPathMark also provides a sixth group, which uses non-standard features of XPath, 
such that the transitive closure of a path expression. We excluded this group from our 
test, since neither our implementation nor the query engines that we used supported these 
extensions. 

9.2.3 XMark test suite 

To validate the extension of our approach to XQuery and in particular the path extraction 
algorithm, we use queries from the XMark benchmark suite ( 11321 ). These queries feature 
"for" expressions guarded by "where" conditions and make use of element constructor to 
format their results. The corresponding code for the querie s under consideration is given in 
AppendixlBl 

9.3 Protocol 

We have designed two experiments, based on two different XQuery engine to validate 
our approach. For each engine and each query we described in the previous section we 
applied the following protocol. First, we tested the engine against original documents of 
increasing size and stopped when the query engine could not handle the input document 
anymore. Then we repeated the experiment a second time but used document pruned by 
our prototype as input for the query engine. We detail now our experimental settings for 
the two engine we considered: Saxon-b/XQuery and MonetDB/XQuery. 
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9.3.1 Test machine 



The experiments where performed on an desktop PC, with an Intel Core 2 Xeon 3Ghz CPU, 
3.5 GB of RAM and a S-ATA hard-drive. We used Ubuntu Linux 9.10 64bits (featuring a 
2.6.31 kernel) as operating system. The file-system used was ext3, with default settings. 
The OS was allocated 6 GB of swap space and tests where done in a reduced environment, 
where only essential services were running concurrently with our experiments. In what 
follows, when timings are reported they are obtained by removing the best and worst timing 
of 5 runs and averaging the remaining three. Also, all the parameters we measured (running 
time, memory consumption, I/O operations, . . . ) were measured in independent runs in 
order to have as little impact as possible on the experiments. The memory consumption of 
a running program was measured by monitoring the so-called "resident set size" field of the 
/proc/pid/statmpseudo file (this fields summarizes the amount of private data mapped 
in physical memory by the process, excluding shared segments such as shared libraries or 
shared mmaped files). I/O operations where monitored using the iotop utility. 

9.3.2 Saxon-b/XQuery 

Saxon 1251 is a popular XML library which implements various W3C standards (XPath, 
XQuery and XSLT) and has full schema support. We used version 9.0 of the Saxon-b 
XQuery engine (which is the Open Source one). Saxon being a main-memory query engine 
we focused on the following measurement both for pruned and unpruned documents: query 
answering time (excluding the parsing time of the document and serialization of the results, 
as reported by Saxon's debugging flags) and memory consumption. Saxon being written in 
Java, we used the latest version of the Sun's JVM available (1.6.0, 64 bit version) and set 
the amount of memory available to the JVM to the total physical memory of the machine. 

9.3.3 MonetDB/XQuery 

MonetDB/XQuery lfl2ll is a well established native XML database with full XQuery sup- 
port. Contrary to Saxon, MonetDB stores on disk an index allowing fast navigation and 
query answering. In particular since it uses the disc as secondary storage, MonetDB is not 
limited by the amount of physical memory (it uses as much memory as possible to answer 
a query efficiently and performs its own page management by mapping memory pages to 
the disk and reading them back when needed). Therefore for such query engine, speed is 
directly proportional to memory: the more memory is available, the less swapping occurs 
between pages on disk and pages in main memory. The three parameters we measured for 
MonetDB were the query answering time (again we did not consider document parsing or 
serialization time), the size of the generated index on disk for a given document and the 
amount of I/O performed to answer the query. Indeed since MonetDB tries to max out 
its memory use to favour query answering time, measuring memory consumption does not 
reflect the actual scalability improvement one could expect when pruning documents. Disk 
access on the contrary are the bottleneck for such an engine and their frequency directly 
impacts query answering time. 

9.4 Experimental results 
9.4.1 Pruning precision 

We gauged the precision of our pruning algorithm for the full set of XPathMark and XMark 
queries by comparing the size of the pruned document (serialized on disk) with the size of 
the corresponding original document. We report in Figure |6]the pruning ratio in percent of 
the original file size for XPathMark (labelled Al to E8) and XMark (labelled Ml to M20) 
queries. 
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9.4.2 Saxon-b/XQuery 



In our testing environment the biggest un-pruned document that the Saxon engine could 
handle was 671 MB large. We report in Figure |7] the original size of the largest pruned 
document Saxon could handle and the size of its projection (both in MB). Also, for a 
document of size 671 MB, we report the running time and memory consumption for the 
original and pruned version (as well as the size of the pruning). Lastly, we report the 
speed-up factor obtained thanks to pruning and the memory improvement we achieved (in 
percent of the original memory consumption) for the projected document. Due to the lack 
of space, we do not detail all of the XPathMark and XMark queries but rather for each 
category we give the "best performing query" -that is, the one for which we could achieve 
biggest speed-up- and the "worst performing query", the one for which the speed-up was 
the smallest. 

9.4.3 MonetDB/XQuery 

Since MonetDB makes use of the secondary storage (disk) to query arbitrarily large docu- 
ments, we chose a different approach to validate our pruning algorithm. We fixed the size 
of the input document to 3363 MB and then indexed it into the MonetDB document repos- 
itory, yielding an index (on disk) of 4644 MB (as reported by the MonetDB administrative 
interface). Then for each query, we pruned the 3363 MB document with respect to the 
input query and indexed it. We summerize the results in Figure[8] The first line in the table 
reports the size in MB of the index corresponding to the pruned document. The second line 
reflects the ratio between the amount of I/O operation performed by the MonetDB server 
for the pruned file and the amount of I/O performed on the original file. We only take into 
account of amount of data read from disk which helps us gauge the amount of data fetched 
from the index on disk into main-memory. In this same figure, the graphics represent the 
absolute query answering time in seconds for the original and pruned document. 

Finally, the third line gives the speed-up in query answering achieved through pruning. 
We were not able to run the query E5 on our version of MonetDB (the server segfaulted at 
some point during the query computation). 

9.5 Interpretation 
9.5.1 Pruning precision 

The results from Figure |6] shows that, for the vast majority of the queries we considered, 
the document can be pruned to less than 10% of its original size. More precisely on the 58 
queries we considered (20 XMark queries and 38 XPathMark queries): 

• 47 queries yielded a projected document whose size was less than 5% of the original 

• 5 queries (M10, B3, B4, D2, E4) had a pruning ratio between 5% and 10% of the 
original file size 

• 2 queries (B2, E3) had a pruning ratio of 17.035% 

• 4 queries (M14, E6, E7, E8) had a pruning ratio of 27.35% 

It should be noted that queries such as M14% return the content of a description element, 
consisting of almost all the textual data contained in the original XMark document. Since 
in these queries the value of the whole element is needed at runtime to perform string 
searching operations, there is little that can be done from the point of view of static pruning. 
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9.5.2 Saxon-b/XQuery 



As we can see from the results in Figure |7]pruning the document before querying it always 
yields a speed-up and a reduction in memory use for a main memory engine such as Saxon. 

On the one hand, for queries whose main bottleneck are string operations (such as calls 
to the contains function in M14 and E7), document projection gives very little speed-up. 
On the other hand, query Al or C3 see a dramatic speed-up (20 and 17 times faster than 
the original respectively). This shows that despite the various optimizations built into the 
engine, a significant amount of time is often spent by iterating over "non-relevant" nodes 
which are discarded by the pruning process. 

On the side of memory consumption, pruning the document unsurprisingly reduces 
memory usage drastically. Indeed, document projection reduces the number of elements 
and therefore simplifies the tree-structure of the XML document. This aspect is critical 
for main-memory engines which often (as in the case of Saxon) represent the document 
as a pointer-based data-structure (e.g., following the DOM model [ 17] where each element 
is represented as a node which contains a pointer to its first child to the next sibling, and 
to its parent). Indeed, we experienced for Saxon (but we observed similar behaviour in 
other main-memory query engines) a 1 12MB XMark document would occupy 430MB of 
RAM while the same document stripped of its data — amounting to only 36MB on disk — 
would occupy 340MB of memory. As Figure |7]illustrates, our pruning technique precisely 
addresses this issue, reducing in most cases the memory consumption to a few percents of 
what is needed to handle an un-pruned file. 

9.5.3 MonetDB/XQuery 

MonetDB is known to be one of the fastest XML database available. The efficiency of the 
MonetDB/XQuery engine is essentially due to the stair-case join operation (|24|) which 
minimizes the amount of intermediate sets constructed to answer an XPath query. Even 
so, the use of type-based document projection often improves query answering time. In 
particular, as shown in Figure [8] a smaller index often yields less I/O operations which in 
turn increases the speed of the query engine. On the contrary for queries such as D2, M14 
and M15, the document is already optimally indexed and reducing the size of the index 
does not reduces the amount of I/O which explains why for these queries the gain in speed 
is null. Yet for some queries the speed-up can be up to twenty-folds (Dl). 

9.5.4 Comparison with related work 

These results are a clear-cut improvement over current technology. While we cannot dir- 
ectly compare processing performances since no implementation of the other pruning ap- 
proaches is publicly available, we want to stress two points. First, for XMark queries the 
pruning precision we achieve is equal or better than what is obtained with other approaches 
(with the exception of query M10 for which [27] achieves a pruning ratio of 4.5% where 
we could only prune the document down to 9.2% of its original size). Second, performing 
pruning never is a bottleneck in our case thanks to the fact that our solution consists of 
a single buffer-less traversal of the input document (on our test machine we were able to 
efficiently prune arbitrary large documents, while in case of 1271 pruning can end up using 
as much memory as the execution of the query). 

The experiments also illustrate that our approach retains a very high precision even in 
the presence of complex XPath features (like backward axes and external functions). While 
it is true that the technique of 13T1 could be used to allow Marian and Simeon's work to 
handle backward axes, it would still not be, to our sense, a satisfactory solution. The first 
reason is that the rewrite rules given in I13T1 do not support the use of data- value or negation 
in the filters of the original query (see J3]). For instance the query 

descendant:: keyword [not (ancestor:: item) ] 
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cannot be written without backward axis. Second the query generated may be exponentially 
bigger than the original one (and its computation takes exponential time in the size of 
the original) and may introduce several predicates as well as descendant-or-self axes. 
Both features degrades the pruning precision of 11271 . 

10 Conclusion and future work 

Our experiments show the clear advantages of applying our optimisation technique to query 
XML documents, and the characteristics of our solution make it profitable in all application 
scenarios. We discussed several aspects for which our approach improves the state of the 
art: for performances (better pruning, more speedup, less memory consumption), for the 
analysis techniques (linear pruning time, negligible memory and time consumption), for 
its generality (handling of all axes and of predicates), and, last but not least, for the formal 
foundation it provides (correctness formally proved, limits of the approach formally stated). 

The present work extends and improves the preliminary version presented at VLDB 
2006 in several aspects. From a formal point of view, the use of regular tree grammars 
as schema model makes the technique applicable to the various kind of schemas currently 
in use. Furthermore, the closure properties that we proved ensure that type-based projection 
is at most as expensive as validation for a given class of schema language. We also handle a 
richer set of queries formally (in particular we handle nested predicates in XPath') and took 
special care to document how to encode or approximate several important XPath idioms 
that were lacking from the formal presentation. On a practical level, we have validated 
our approach against state of the art query engines, using realistic queries and data sets. In 
particular, not only did we test against an efficient main memory query engine (Saxon) but 
also demonstrated that our approach can be used to improve, sometimes by a double digit 
factor, the performances of an already very optimized disk-based XML database such as 
MonetDB/XQuery. 

Future work will be pursued both at a formal and practical level. At a formal level, 
one of the main shortcoming of our approach is its reliance on XPath syntax. Indeed, even 
though we managed to isolate a fragment of XPath that we could formally reason with, it 
still leaves us with a syntax-directed approach. The problem with this is twofold. First, it 
makes the proofs and the specification of algorithms quite tedious and unnecessarily intric- 
ate. Second and more importantly, our pruning inference algorithm might yield different 
type projectors depending on the syntax of the original query. For future work, we would 
like to tackle a semantic based approach. In particular it seems worthwhile to consider 
more theoretically sound formalisms for tree queries such as, for instance, MSO formula 
or tree automata. The latter in particular would allow us to reuse our pruning algorithm for 
pattern-matching based languages (such as the CDuce language (Q and its pattern-based 
query language CQL l28l |6l [T4ll ). It is also known that tree-automata (as well as MSO 
formula) have better closure properties than XPath expressions and support fine-grained 
set-theoretic operations (intersection, union, complement) that have been used with suc- 
cess to devise very precise type-systems for XML 11221 . 

At a practical level we would like to see a tighter integration between document- 
projection and query engines. Firstly, although quite crude, our experiments show that 
even a carefully designed indexed system such as MonetDB can benefit from document 
pruning. It seems interesting to develop further such preliminary results and design a pro- 
jection aware XML index. In other words we would like to be able to equip any native 
XML query engine optimizer with a type-projector component. In particular, one could 
think of an index consisting of the original document together with its projected versions. 
Textual data could be shared between the main document and the projected ones which 
would merely become a projected view of the tree structure of the document. We make the 
hypothesis that the overhead of such pruned tree structures would be quite small compared 
to the size of an XML index while providing significant speed-up in query answering time. 
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Secondly, coupling our projection algorithm with early query answering techniques 
would allow us to achieve further pruning, especially when runtime conditions are involved. 
For instance we could use our type-inference algorithm to determine on what type of ele- 
ments a given built-in function is applied to; for instance in an expression such as 

contains (.//* , "foo"). 
This information could then be used at loading time to discard elements that do not match 
the predicate. 
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A Detailed proofs 



LEMMA ( (13.31 )) Let K be a type projector for (y,E). Then for every tree t (Eg (y,E) it 
holds (t\o7t) ^ t. 

Proof 1 The proof is a straightforward induction on t. 



Lemma (Erasure preserves locality (13.41 )) Let (y,E) be a local tree gram- 
mar and (y' ,E') a regular tree grammar. If (y 1 ,E') <: (y ,E) then (y' ',£') is a local 
tree grammar. 

Proof 2 By contradiction, suppose that (y',E') is not a local tree grammar. By Defini- 
tion al^ y'cy therefore, < \S?\ < 1. 

• Then, either there exist two competing rules A — > I [r' a ] and B — > I [r' b ] in E'. Then 
by definition of erasure, there exist two rules A — > l[r a ] and B — > l[rb\ in E such 
that r' a = r a \iy a and r! = r/,\^ b for some N a QNames(r a ) andNb QNames^t,). But 
then these two rules share the same label I, and therefore are competing one with 
the other, which contradicts the fact that (y ' ,E) is a local tree grammar. 

• Or there exists two rules C — > I [r\ ] and C —> I' [r2\ in E' with the same left hand- 
side (and distinct labels). But then, by definition of erasure, there exists two cor- 
responding rules in E, C — > l[r\] and C —> /'[f^] such that r; = i G {1,2} for 
some names N[. Therefore there are two rules in E with the same left-hand side, 
which contradicts the fact that (y ,E) is a local-tree grammar. 

□ 



Lemma (Erasure preserves single-typedness (13.51 )) Let (<9",E) be a single- 
type tree grammar and (y 1 ,E') a regular tree grammar. If (y',E') <: (y,E) then 
[y" ,E') is a single-type tree grammar. 

Proof 3 By contradiction suppose that (y 1 ,E') is not a single-type tree grammar and 
proceed by case analysis: 

• either there exist two competing non terminals A and B in y' . But by definition of 
erasure, y' C y and (y ,E) has two competing start symbols, which contradicts 
the hypothesis that (y,E) enjoys the single-type property. 

• or there exists a rule X — > / [ r' ] and there exist two competing non terminals A and 
B in Names(r'). Since (y',E') <: (y,E), then there exists X — >• l[ r ] such that 
i"' = >"\n for some N C Names(r). But that means that A and B are in Names(r), 
which implies that (y ,E) is not a single-type tree grammar, thus contradicting 
our hypothesis. □ 
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Lemma (Union closure of local type projectors (13.61 )) Let (S*,E) be a 
local tree grammar. Let (5?\ ,E\) and (y 2 , £2 ) £>e two tree grammars such that [S^\ ,E\) <: 
(y,E) and (J^, £2) <: ,E). Then {5?\ L>y 2 ,Ei UE 2 ) is a local tree grammar. 



Proof 4 Consider {5?\ U y 2 , E\ U £2 ) and suppose, by contradiction, that it is not local. 
First, remark that by definition of erasure, 5?\ C 5? and 5? 2 Q & ? , therefore 5?\ U ,5^ 2 Q 
,5^ and consequently \5^\ UJ^I < \^\ < 1- Second: 

• either we have two rules A —> l[r a ] and B —> l[rb] (with A and B distinct). By 
Letnma \3^4\ we know that (5f\,E\) and (y 2 ,E 2 ) are local tree grammars. Then it 
must be that one of two rules at issue is in {S^\^E\) and the other in (^2: £2). oth- 
erwise one of the two grammars would not be local (it would have the competing 
pair in its rules). Without loss of generality we can suppose that A — > l[r a ] G E\ 
and B — > Ifo] G Ei, Since {S^\,E\) <: (5? ',£*), then by definition of erasure there 
exists a rule A — > l[r' a ] G E with r a — r' a \n for some N C Names{r' a ). Similarly, 
there exists B — > l[r' b ] G E with ry = r' b \ N i for some N' C Names (r^). But then, this 
means that we have two competing rules in E, which contradicts the hypothesis 
that (.y,E) is a local tree grammar. 

• or, there are two rules C — > l[r{\ and C — > l'[r2] with the same left-hand side and 
distinct labels in E\ UE2- Similarly to the previous case, we must have C — > I [r\ ] G 
£1 and C — > l'[r 2 ] G E2, otherwise (5f\,E\) or (5?%, E 2 ) would not be local. But 
then by definition of erasure, it means that there exists two rules, C — > I [r[ ] EE 
and C — > l'[r' 2 ] G E such that r,- = r'^^i G {1,2} for some names N{. This means 
that there are two rules in E with distinct labels and the same left-hand side, which 
contradicts the assumption that (S^,E) is a local tree grammar. 

□ 



Lemma (Union closure of single-type type projectors (13.71 )) 
Let (y,E) be a single-type tree grammar. Let (S\,Ei) and (y 2 ,E 2 ) be two tree gram- 
mars such that (S^uEi) <: (5?,E) and (y 2 , E 2 ) <: ,E\ Then (J^ U^-El U£ 2 ) is a 
single-type tree grammar. 

Proof 5 Consider (Sf\ U ,5^2 ,E\ U £2 ) and suppose by contradiction that it does not en- 
joy the single-type property. This implies that there exists a rule X — > l[r], such that 
Names(r) contains two competing non-terminals A and B. Moreover, by Lemma 1331 
we know that (S^i, £1) and (^,£2) are single-type tree grammars. Therefore X — > l[r] 
cannot be in E\ nor in £2 because they have the single-type property. The only solution 
is that there exists a rule X — > l[r\\ in E\ with A G Names{r\) and X — > l[r2] in £2 with 
B G Names (r 2 ), and that r = r\\r 2 (remember that we identify two rules with the same 
left-hand side and the same label by merging them into a single rule). Therefore, by 
the definition of erasure, there exists a rule X — > Z[rJ] in E such that A G Names^). 
Similarly, there exists a rule X — > l[r' 2 ] in E such that B G Names(r' 2 ). Since we identify 
such rules, there is a rule X — > l[r[ \ r' 2 ] in E. But then, this rule contains both A and B 
which are competing. This contradict the hypothesis that (y,E) is a single-type tree 
grammar. □ 



LEMMA (ERASURE PRESERVES LOCALITY (13.41 ) Let (S*,E) be a local tree grammar 
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and (.y",E') a regular tree grammar. If ' (y" ,£') <: (y,E) then (y',E') is a local tree 
grammar. 

Proof 6 By contradiction, suppose that (y 1 ,£') is not a local tree grammar. By Defini- 
tion al^ y'cy therefore, < \S?\ < 1. 

• Then, either there exist two competing rules A — > I [r' a ] and B — > I [r' b ] in E'. Then 
by definition of erasure, there exist two rules A — > l[r a ] and B — > l[rj,] in E such 
that r' a = r a \N a and = r b\^ b for some N a QNames(r a ) andNb QNames^t,). But 
then these two rules share the same label I, and therefore are competing one with 
the other, which contradicts the fact that (y ,E) is a local tree grammar. 

• Or there exists two rules C — > l\r{\ and C — > /' [r?] in E' with the same left hand- 
side (and distinct labels). But then, by definition of erasure, there exists two cor- 
responding rules in E, C —> I [r\ ] and C —> I' [r' 2 ] such that r; = r- |jv" ; , J G {1,2} for 
some names N[. Therefore there are two rules in E with the same left-hand side, 
which contradicts the fact that (y,E) is a local-tree grammar. 

□ 



Lemma (Erasure preserves single-typedness (13.51 )) 

Let (y,E) be a single-type tree grammar and (y',E') a regular tree grammar. If 
(y 1 ,£') <: (y ,E) then [y" ,E') is a single-type tree grammar. 

Proof 7 By contradiction suppose that (y 1 ,E') is not a single-type tree grammar and 
proceed by case analysis: 

• either there exist two competing non terminals A and B in y' . But by definition of 
erasure, y' C y and (y,E) has two competing start symbols, which contradicts 
the hypothesis that (y,E) enjoys the single-type property. 

• or there exists a rule X — > / [ r 1 ] and there exist two competing non terminals A and 
B in Names(r'). Since (y',E') <: (y,E), then there exists X — >• /[ r ] such that 
i"' = >"\n for some N C Names(r). But that means that A and B are in Names(r), 
which implies that (y,E) is not a single-type tree grammar, thus contradicting 
our hypothesis. □ 



Lemma (Union closure of local type projectors (13.61 )) Let (y,E) be a 
local tree grammar. Let (S\,E{) and (.5*2, £2) be two tree grammars such that (y\,E\) <: 
(y,E) and (=5*2, £2) < : \E). Then (y\ U^j^i U.E2) is a local tree grammar. 

Proof 8 Consider (y± \jy^E\ V}Ej) and suppose, by contradiction, that it is not local. 
First, remark that by definition of erasure, J^i C y and y% C y , therefore y\ U S^i C 
y and consequently \y\ U^l < \y\ < 1- Second: 

• either we have two rules A — >• l[r a ] and B —> Z[r/,] (with A and B distinct). By 
Lemma \3~4\ we know that (y\,E\) and (=5*2, £2) are local tree grammars. Then it 
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must be that one of two rules at issue is in (,5f\,E\) and the other in (S fi 2,E2), oth- 
erwise one of the two grammars would not be local (it would have the competing 
pair in its rules). Without loss of generality we can suppose that A — > l[r a ] £ E\ 
and B — > l[rb] £ £2- Since (J?i,£i) <: (5? \E), then by definition of erasure there 
exists a rule A — > l[r' a ] £ E with r a — r' a \n for some N C Names(r' a ). Similarly, 
there exists B — > l[r' b \ £ E with ry = r' h \ N i for some N' C Names(r' b ). But then, this 
means that we have two competing rules in E, which contradicts the hypothesis 
that (.y,E) is a local tree grammar. 

• or, there are two rules C — > l[r\] and C — > l'[>"2] with the same left-hand side and 
distinct labels in E\ UE2- Similarly to the previous case, we must have C — > I [r\ ] £ 
£1 and C — > l'[r2\ £ E2, otherwise (Sf\,E\) or (j?^ ,£2) would not be local. But 
then by definition of erasure, it means that there exists two rules, C — > I [r[ ] £ E 
and C — > l'[r' 2 ] £ E such that r,- = r 1 ^., i £ {1,2} for some names N{. This means 
that there are two rules in E with distinct labels and the same left-hand side, which 
contradicts the assumption that (5^ ,E) is a local tree grammar. 

□ 



Lemma (Union closure of single-type type projectors (13.71 )) Let (S^,E) 
be a single-type tree grammar. Let (j>^i,Ei) and (.^2^2) be two tree grammars such that 
(^1, £1) <: (y,E) and (&2,E 2 ) <: (y,E). Then {S?\ US" 2 ,Ei UE 2 ) is a single-type tree 
grammar. 

Proof 9 Consider (S^i U 5^2 , E\ U £2 ) and suppose by contradiction that it does not en- 
joy the single-type property. This implies that there exists a rule X —> l[r], such that 
Names(r) contains two competing non-terminals A and B. Moreover, by Lemma 1331 
we know that (y\,E\) and (S e 2 1 E2) are single-type tree grammars. Therefore X — > l[r] 
cannot be in E\ nor in E2 because they have the single-type property. The only solution 
is that there exists a rule X — > l[r\] in E\ with A £ Names(r\) andX — > l[r2] in £2 with 
B £ Names (r2), and that r = r\ \ r2 (remember that we identify two rules with the same 
left-hand side and the same label by merging them into a single rule). Therefore, by 
the definition of erasure, there exists a rule X —¥ /[r'J in E such that A £ Names^). 
Similarly, there exists a rule X — > l[/ 2 ] in E such that B £ Names^). Since we identify 
such rules, there is a rule X — > l[r[ \ r' 2 ] in E. But then, this rule contains both A and B 
which are competing. This contradict the hypothesis that (J5",£) is a single-type tree 
grammar. □ 



LEMMA (15.21 ) Let t be a tree 3-valid with respect to the schema (j?^,£). For every 
S C Ids(t) and type x, if3(S) C Dn(z), then 

1. 3(lAxisj t (S)) CDn(A E (r,Axis)) 

2. 3(S: :,Test) CDn(T E (z,Test)) 

Proof 10 The proof is done by case analysis on the possible axes (for ( 1.)) and tests (for 
(2.)). 
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1. We only need to consider the self and descendant axes. Indeed child is an 
instance o/descendant, and ancestor and parent are the dual o/descendant 
and child respectively. 

self: By Definition \4.3\ we have that [self ], (5) = S, therefore 3([self ],(5)) = 
3(5). By Definition 15.71 As (t, self) = T. Since 3(5) CDb(i) by hypo- 
thesis, we can conclude that 

3([self],(5))C J D w (A £ (T,self)) 

descendant: we suppose i' G [descendant], (5). Let us show that 3(i ) G 
£>«(Ag(T, descendant)). Ifi' G [descendant], (5) then by Definition \4.3\ 

3i G Ids(t) such that (i,i') G Edg(t) + 

By definition ofEdg this means that there exists a sequence io,ii, ■ ■ ■ ,i« such 
that t@ik = l[...t'...} and Rootld(t') = ik+i, with io = i and i„ = i'. Said 
differently, io,ii, ■ ■ ■ ,i« is the path from i to its descendant i' in the tree t that 
we consider. 

Let us now call = 3(i#). For all k, there is a rule — > G E such that 
Xk+i G Names (Rk) ( since t is 3 -valid with respect to E and by Definition \2.8\) . 
Hence, there exists a chain Xq =>e ■ ■ ■ =^£ X n with Xq in 3(5). By hypothesis, 
3(5) C Dn{x), therefore Xq G T. Since Xq . . . =>£ X n , by Definition \4.3\ 
we have X„ — > R„ G A#(T, descendant). But since, X„ = 3(i'), we have 

3(i') G-D«(A £ (T,descendant)) 

and therefore 

3( [descendant], (5)) C Dm(A£(t, descendant)) 

2. By case on the test: 

node: By Definition \4.2\ we have: 

node : :,5 = 5 

By Definition 15. 1 1 we have T£-(T,node) = T. Since 3(5) CDn(i) by hypo- 
thesis, we have that: 

3(5: :,node) CD/i(T £ (T,node)) 

a (for some element name a): suppose i G a: :,5. Lef Mi s/zovv that 3(i) G 
£)n(T£(T,a)). By Definition \4.2\ we know that t@i = a[f] for some forest 
f. Since t is 3-valid with respect to E, then 3(i) = Y and there exists a rule 
Y — > a[R] in E. Since i G 5, we also have that 3(i) G Dn(%) (by hypothesis). 
By Definition 15. 71 since Y — > a[R] G 1, then Y — > a[R] G Te{t,o) and there- 
fore 3(i) =Y G D«(T£'(T,a), hence 

3(5: :,flCD«(T £ (T,fl)) 
text: similar to the previous case. 

□ 
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Lemma (Termination of type inference (15.51 )) Let {y,E) be a type, P a path, 
and £ and £' two environments. If there is a derivation for the judgment £ hg P : £', then 
this derivation is unique and finite. 

Proof 11 Uniqueness of the derivation is immediate, since the rules are syntax-directed: 
at each step, at most one of the rules applies ( if no rule applies, there is no derivation 
and the output type is ). Finiteness can be shown by a simple induction on the length 
of the path, noted l(P), that we define as follows: 

I (Axis : : Test) = 1 

l(Axis::Test\_C~\) = l+£/(P) 

Pec 

l(P/P') = l(P) + l(P') 

1{P\P') = l{P) + l(P') 

Basic case: The query has length 1, meaning it is a single step without predicate. Then 
the only rules we can apply are (down-axis), (up-axis) or (test). These rules have 
no premise, therefore the derivation is finite and has length 1. 

Inductive case: If the query has several steps, then the rule (sequence) applies. The 
lengths of the queries in the premises of the rule are strictly less than the length of 
the query in the goal, by definition ofl(_). By induction hypothesis both premises 
have a finite derivation, therefore the goal can be derived with a finite derivation. 

Similarly, if the query is a top-level union, the typing rule (union) applies. 

If the query is a single step with a predicate, then rule (predicate) applies. We 
should first remark that there is a finite set of rules in £ typ and a finite number 
of path n in Cond. Thus, there are exactly |£t yp | X n premises for this rule. For 
each one of these premises, the path Pjk is such that l(Pjk) < 1{P), by definition 
ofl(_). By induction hypothesis, every premise has finite derivation, therefore the 
judgment in the goal of the rule has a finite derivation. 

□ 



Theorem (Soundness of type inference (15.61 )) Let (y,E) be a type and P a 
path. Let E = {X R\X ->• R 6 E ,X 6 If (E ,E ) \~e P ■ 0, k) then: 

Dn(T)D [p(lPl(RootId(t))) 

tej(y.E) 



Proof 12 We consider the following, more general judgment: 

(T,K) h £ P:(T',0 
We show simultaneously the following properties: 

1. Soundness : for all tree t 3-valid with respect to (S^ ,E) and all set S C Ids(t), if 
3(S) CBh(t) then: 

3([P] ^ (5))CZ)«(T , ) 

2. Context well-formedness, if 

K={Y -^R | VZeDn(T) 7 X =>* E Y =>* E Z} 
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then 

k' = {Y -> R | VZ G Dn{i?),X =*>| F Z} 

Property 1 is a generalization of the soundness property we are proving. Given a set of 
context nodes S whose types are in X, the type of \P\,(S) is in t '. Property 2 states that 
the algorithm preserves the well-formedness of contexts. We prove both properties by 
induction on the depth of the typing derivation, which is finite by Lemma \53\ 

Base case: 

(down-axis): Property 1 is true by a direct application ofLemma \5.2\ Property 2 
holds by definition of A E (_, _) 

(up-axis): By Lemma \5.2\ 

3(lAxis::nodel,{S))CDn(A B {T,Axis)) 

We must now show that 3( [Axis:: node], (S)) is in E ctx , for it to be in the 
intersection of both. Since K is a well-formed context: 

K={Y ^R\VZe Dn(x),X =>* E Y =^ Z} 

Let us first consider the case Axis = ancestor. By Definition \4.3\ 

[ancestor ::node], (5) = {i' | i G S A (i',i) G Edg + {t )} 

Thus 

3({i' | i G SA(i',i) G Edg + (t )}) - {Y\Z G 3(S) A Y ^+ Z} 
Since we supposed 3(S) C Dn(%), then clearly: 

{Y -» R\Z G 3(5) A F =>^Z}<Zk 

thus 

3(lAxis :: node], (S)) C Dn(A E (r,Axis) n k) 

which proves Property 1. The case for the "parent" axis is a particular 
instance of "ancestor". As for Property 2, K is the set of rules used to 
derive the context node type T. Ae(k,Ams) is the set of all the parent rules 
(or ancestor rules) of the rules in K. Consequently, the intersection is still a 
well formed context. 

(test) : Similarly to the case of Rule (down-axis), Property 1 is a direct applica- 
tion ofLemma \5.2\ For Property 2, we can remark that 

k' = Kr}A E (T E (x,Test), ancestor) 

contains all the rules leading to a node in X for which Test succeeds ( includ- 
ing the ones of the selected node), therefore it is a well-formed context. 

Inductive case: 

(predicate) : Let us consider: 

[self :: node [ 0r ; - And t P jk ] ] f (S) 

By Definition \4.4\ we have 

[self :: node [ 0r ; And A .P /A .] 1,(5) = (Ji 

ier 
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where T is the set of ids satisfying the predicate: 



T = {i\ieSA\J/\lP jk M{i})^0} 



j k 



Let us consider i £ T. We have that i £ S, and since i is part of an 3-valid 
tree t, there exists Xj = and an associated rule X,- — > Rj. We apply the 
induction hypothesis on: 



{{Xi^Ri},Z ctx ) \- E Pj k :E 




ijk 
typ 



( the output type). So 



3([self ::node[ Or jAnd k P jk 1} t (S)) CBh(t') 



which proves Property 1. Property 2 holds for the same argument as in rule 
(test). 

(sequence) : Property 1 is true by induction hypothesis on both premises. Prop- 
erty 2 is true for the first premise, by induction hypothesis. In particular, E' c ' tx 
is a well-formed context. We can then apply the induction hypothesis on H" 
and we have that £' ctx is a well-formed context too. 

(union) : is similar to the previous case. 



Lemma (Witness of a grammar (15.81) ) Let (y,E) be a non-recursive, ^-guarded, 
parent-unambiguous local tree grammar. There exists a document t, 3-valid with respect 
to (y,E) such that: 



we call such a document a witness of the schema (5? ,E\ 

Proof 13 Since the tree grammar is non recursive and parent unambiguous, we can 
prove the lemma by induction (I) on the height of the grammar, seen as a DAG. 

Basic case: the grammar has height 1. It consists therefore of a single rule. The rule is 
either X — > String and a document s\ is a suitable witness; or the rule is X —> a[ ] 
for some label a and the document a\ \ ] is a witness of the grammar. 

Inductive case: Consider ({X},E). The rule for the start symbol X is X — > a[r\ ■ ■ ■ r n ] 
for some label a (since E is ^-guarded, the rule must have this shape). We show 
by induction (II) on the structure of the regular expression r; that there is witness 
for this regular expression. 

Basic case: Either rj = e, and therefore the empty forest () is a suitable wit- 
ness. Or rj = Z. Then consider the grammar ({Z},E') where E' = {Y — > 
R\Y -^ReE,Z =*>| Y}, that is the restriction ofE to Z. Then, height(E') < 
height(E) since at least the rule associated with X is not in E' (and because 
E is not recursive and parent unambiguous). Therefore, by induction hypo- 
thesis (I) there exists a witness t z for Z. 



□ 



VX € Dn (E ) , 3i e Ids (t ) such that 3(i)=X 
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Inductive case: Either r, = (r' i \r J /)* and by induction hypothesis (II), there is a 
witness t- for r 1 i and t'{ for rf. Then, the forest t'^t'/ is a witness for ri (the 
first iteration of '* matches ti and the second one matches t-'). 
Or r[ = (rj)* and r- is not a union. Then by induction hypothesis (II), there 
is a witness ?■ and t\ is also a witness for r,-. Or rt = r\r". By induction 
hypothesis (II), there is a witness t\for r\ and t" for r". Then, the forest t'^t" 
is a witness for r\. 

Therefore, for each r,- there is a witness t[. Then the tree a\t\ ... t„] is a witness of 
the rule X — > a[r\ . . . r n ], 

□ 



COROLLARY (15 .91 ) Let ({X},E) be a non-recursive, ^-guarded, parent-unambiguous 
local tree grammar and t be its witness. Let \Y\ . . . , Y n } C Dn (E). IfY\ =>£ . . . =>£ Y n , then 
there exists {ii, . . . ,i«} C Ids(t) such that 

V/ G {2 . . .«}, G g(f)) A 3(id^) = y,_! A 3(i ; ) = Y t 



Proof 14 This is a direct application of Lemma \5.8\ We know that for all Y 6 
Dn(E),3i G Ids(t)such that 3(i) = Y. This is true in particular for {Y\ ,.. . ,Y„). Con- 
sider Yi and Fj+i. We have Y( =^e F'+l which means that in E, there is a rule Yi — > a[r,-] 
for some label a and with Yf+\ G r,-. Consequently, t = a[. ■ . fdj + i, . . .]. Therefore, 
(1,^+1)6^(0. □ 



Theorem (Completeness of type inference (15.101 )) Let (y,E) be a ^-guarded 
non-recursive and parent unambiguous local tree grammar, and P a path. Let 

E Q = {X^R\X^ReE,Xe y}. 

If(E ,E Q )h E P:(x 1 K)then: 

Dn(T)C \J(lP} t {RootId(t))) 

tejE 



Proof 15 Like for the proof of Theorem 15.61 we consider the following, more general 
judgment: 

(T,K) h £ P:(T',rc') 

let t be the witness ofE. We show that ifDn(x) C J(S) then, Dn{x') C 3{\P\ t {S)). If 
this holds for the witness t then it holds for the union of all trees 3-valid w.r.t to E (which 
contains t). Informally, this means that if the type X "describes precisely" the nodes in 
S, that is, if there are no unneeded rules in X, then the type x' describes exactly the result 
of the query: for each rule in x', there is a node in the result of the query typed by that 
rule. We proceed by induction on the depth of the typing derivation: 
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Basic case: 

(down-axis): self axis: We supposed Dn(x) C 3(S). We have x' = A£(t, self) = 1. 
We also have: 

[self ::node]] f (S) = S 
by Definitional} Therefore, Dn(x') C 3(S) and so 

Dn(r') C3([self ::nodeJ r (S)) 

descendant axis: Let us consider names, X G Dn(x) and Y G 
A-e({X}, descendant). By Definition 15.71 we have that X =>* E Y. By using 
Corollary 15.91 we have that there exists a sequence: ii,...,i„ in t such that 
X = J(ii) and Y = 3(i„). We also have that 

Vie{i...n-i}, )€#(?)) 

thus (ii,i„) G S + (t) and therefore that 

i„ G |descendant::node] r ({ii}) 

Subsequently: 

Ae({X}, descendant) C 3([descendant::node] r ({ii})) 

child axis: is a particular instance of the previous case. 

(up-axis) We only treat the case of the ancestor axis, of which the parent axis is 
a particular instance. This case is the symmetric of the descendant axis. Let 
X G Dn(%). Let Y G Ae({X}, ancestor) n K. By Definition 15.71 we have that 

Y =>|- X (and becauseK is a well-formed context). By using Corollarv \5.9\ we 
have that there exists a sequence: ii,...,i n int such that Y = 3(ii) andX = 3(i„). 
We also have 

V/G{l...«-l},(i,-,i, +1 )G^(0) 
thus (ii,i„) G S + (t) and therefore 

i„ G |ancestor::node] r ({ii}) 

Thus, we have 

^e({X}, ancestor) C ^([ancestor ::node]] f ({i„})) 

We must also show that ifY G K then 

Y G 3([ancestor ::node],({i„})) 

(because the output type is intersected with the context for this rule). This is an 
immediate consequence of the well-formedness of contexts. K is well-formed only 
ifKE T\JAe(t, ancestor). 

(test): is similar to the case self of Rule (down-axis). 

Inductive case: 

(predicate) Suppose Xi G Dn(x) (there is a unique rule with Xi as left hand side, since 
we consider a local tree grammar). We consider the premise: 
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Let us consider a set Si C S. By induction hypothesis, if {Xi} C 3 (Si) then ^'ty P Q 
3(|PijtJf (Sj)). There are two cases. 

Either U ,■ E'tj!p = 0- 77ien, X,- ^ Z)« (t') ( the output type). But if this is the case, 
because the type-system is sound (cf. Theorem \5.6\ . then: 
Vi; such that J(U) =Xi, |P/&]*({if}) = and therefore 

4*=P>M{i<}) = 

Of UyfljfcS^p 7^ and since E^* C 3([.P ; -J((S ( -)), ffc means f/iaf I^/t]t(5,) 7^ 
anrf therefore that 

Si C [self :: node [Cond]] f (S) 

Since Xj G 3(5,), X,- £ J([self :: node [Co«t/] ], (5)). Lastly, we remark that for 
each Xj the set Si is not empty. This is a consequence ofLemma \5.8\ for each name 
Xi, there is a node i; in the witness. 

(sequence) : By applying straightforwardly the induction hypothesis on the premises. 

(union) : By applying straightforwardly the induction hypothesis on the premises. 

□ 



Lemma (Termination of type-projector inference (15.121) ) Let (.V,E) be a 
type, P a path, and E and £' environments. The judgment E Ihg P : E' has a unique and 
finite derivation. 

Proof 16 The uniqueness of the derivation follows from the fact that all the rules are 
mutually exclusive ( although not strictly syntax directed) thanks to their side conditions. 

To prove termination, we need some more care than for the type inference algorithm. 
For the judgment: 

E U-e P : E' 

we give it as weight the triple (l(P), r(P) , | E typ | ) ordered lexicographically, where: 
l(P) is the length of the path P, as defined previously 

r(P) is the number of occurrences of a recursive step, that is the number of occurrences 
o/descendant ::node or ancestor ::node in the P 

|E typ | is the number of rules in the input 

The proof is straightforward and consists that for every rule the weight strictly decreases 
in the premises: 

Basic case: the base of induction is an application of(p-step) or (p-erase) does not have 
any premises. 

Inductive case: For the rules (p-union), (p-test) and (p-predicate), the weight strictly 
decreases in l(P) in the premises. For the rule (p-iterate), |E typ | strictly decreases 
in the premises, since in the conclusion the weight has at least two for this compon- 
ent and exactly one in each of the premises. Also, P is unchanged in the premises 
therefore l(P) and r(P) do not increase. For the rule (p-many), the l(P) part is un- 
changed in the premises since /(descendant ::node/P) = /(child::node/P) = 
1 + l(P) and r(P) decreases strictly. 

□ 
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Lemma (Well-formedness of type-projector inference (15.13b ) Let {S^,E) 
be a type, X, t', and K sets of rules, and P a path. If (t, k) \\- e P ■ x', then (t, k) \- e P '. 
(t",Jc") implies k" C t'. 

Proof 17 We use a structural induction on the derivation of (x,k) h £ P : t' which is 
finite by Lemma \5.12\ 

Basic case: The property is trivially true for the rule (p-step) since the result is the union 
of the output type and its associated context. 

Rule (p-erase) can only be applied if the side conditions of the other rules fail, 
which means in the case where the judgment (t, k) \- e P '■ (t", k") does not hold. 
Therefore the lemma is true too in that case. 

Inductive case: 

(p-union) We suppose (t, k) \- e Pi|f2 : (f", k"). This means that the typing rule (union) 
fcf. Figure^ holds and that: 

(T,K) h £ P i: «,<) 

and 

(t,k) h E P 2 :(^,4) 

let Pi produce a type-projector tJ and P2 produce T^; by induction hypothesis 
k'i C t[ and K*2 C But since k" = k" U K%, we have k" C t[ U x' 2 = %' 

(p-iterate) similar to the previous case. 

(p-test) we suppose 

({Y -^R},K) h E self ::Test/P : (t", k") 
According to the (test) typing rule, this means that: 

({Y ->R},K) h £ self ::Test : (Ti,JCi) 

and 

(x u k 1 )\- e P:(t",k") 

the induction hypothesis can be applied on the second premise of the rule (p-test) 
and we have {X\ , K4) IKe P : x' with k" C x'. Since x' C {Y -> R} U t', we /jave 
K - " C {Y — > R} U x' which proves this case. 

(p-predicate) similar to the previous case, we can observe that the context resulting of 
the typing of the first step is passed as argument for the inference of the projector 
of the remainder of the path. 

(p-single) similar to the previous case. 

(p-many) we only treat the case for Axis = descendant, the case for ancestor being 
similar. We suppose: 

({Y -+R},K) \- e descendant ::node/P: (To,fCo) (*) 

and we want to show that Kq C [Y — > R} U x' U x". Let us write: 

({Y — >•/?},«•) h £ descendant ::node : (Ti,K"i) (1) 
({Y ^R}Ux h K) h £ child::node: {x 2 ,K 2 ) (2) 
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Ti =Ae({Y —> R}, descendant) 
K\ = X\ U K 

T 2 =A E ({Y -^R}UTu child) 

K 2 = T 2 U K 



Ti ={Y' \ Y ^+ Y'} 

K\ = X\ U K 

*2 = {J{Y" -+R"\Y' ^ E Y" } 

K 2 = X 2 U K 
by transitivity of =>e, (2) becomes: 

T2 = \J{Y"^R"\Y'^ E Y"} 

{Y'^R' | Y^* E Y'} 
X2 = {Y' ->•/?' | F=>+ F'} = T] 
fC 2 = T 2 U K = T] U K = K - ! 

therefore, for the path P 

{{Y -)•/?} UTi,K) h £ child ::node/P: (lb, Kb) 

fey application of the (sequence) typing rule. We can finally remark that 

(({y -»/?}UTi)riT',K-) h £ child ::node/P: (to, Kb) 

aW f/ie raZes /ns Ti \ t' y/eZc/ an empty projector. Therefore 

(x',k) \~e child::node/f : (To,Kb) 

which allows us to apply the induction hypothesis on the third premise of (p- 
many), which gives us Kb C x", and therefore: Kb C {y — > /?} U f' U t" 

□ 



Theorem (Soundness of type-projector inference (15.141) ) Let (y,E) be a 
type and P an XPath^ query. Let S be the set of rules: S={X^R\XE ,5^}. If 

{S,S)h E P:x 

then X is a type-projector for (,y,E) and for every t £3 (J^Zs) we have: 
lPl XdT (RootId(t)) = \PURootId{t)) 



Proof 18 By simple structural induction on the path. □ 
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Theorem (Completeness of projector inference (15.161 )) Let {y,E) be a 
^-guarded, non-recursive, and parent-unambiguous local tree grammar, and P a strongly- 
specified XPath? path. Let S be the set of rules: S={X^>-R\XE o5^}. If 

(S,S)\^ e P:t 

then there exists t g 3 ,E) such that for each Y ^ R E T, if X = % \ ({Y — > R} U 
Ae({Y —>R}, descendant)), then: 

lP} t \, n (Rootid(t)) + \P\t{Rootld(t)) 



Proof 19 By induction on the length of the typing derivation which is finite. We use The- 
orem I5.70l fo show that if we remove a name Y inferred by the type inference algorithm, 
then we remove nodes from the result of the query applied to the projected document. 
The fact that P is strongly specified is used for the treatment of predicates. Indeed, it 
forces any path in a predicate to be matched exactly by one node. If a path in a predicate 
could be matched by two (or more) nodes, then removing one of the nodes would not 
change the semantics of the query, since there would still be a node present to make the 
predicate succeed. We illustrate this in the example hereafter. □ 



B Text of the XMark and XPathMark queries 
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Base and induction 



E \~E Step : (t, k) , . Z\Y E P x :l x Z\Y E P 1 :x 1 

(p-step) VIL c if t ^ (p-union) ■ 



Elh £ Step : TUiC r Zlh £ Pi|P 2 : T] UT 2 

(p-erase) if no other rule applies 

v Elh £ />:0 FF 



({Xn-ifij.^lr-^PrTi ... {{X n -+R n },K)\Y E P:x n 
(p-iterate) if n > 2 

({X 1 -m u ...,X n -+R 1 },K)H- E P: (J T; 



Path Rules 



({y-»Jg},ic)h £ self::r e 5f:£ Z\Y E P:x 

(p-test) — — ■ if Etyp T 

({Y ^>R},k)\\- e self :: Test /P:{Y -*R} Hz 

Zlh £ P: T 

({y -> P}, K") h £ self :: node [ Or todPy ] : E E lh £ Py : Xy if Et yp 7^ 2 

(p-predicate) 

({Y -^R},K) lh £ self :: node [Or AndPy]/P: {F ->• R} U T U |J(J Ty 

({y ->•/?}. K") h £ Ax;'.s ::node : (t, k 7 ) 

(p-single) 

(for, = !..«) ({X,- ->/?,}, K-')h £ P:E' (t', icQ Ihg P : t" 
({y -> P}, K-) lh £ Axls :: node/P : {y -» tf} U z' U t" 

(*) where Axis 6 {parent, child}, T = {Xi — >■ /fj X„ -¥ R„}, 

%' = {Xj -> fi, I i = I../7, £( yp ^ 0}, t / 0, and t' 7^ 



({y — > i?}, k) h £ Axis :: node : (t, K 1 ) 

(p-many) 

(for/=i..«) ({Xi -^Ri},K') Y E Axis:: node/ P :T1 (V, k*) \\- e s(Axis) :: node/P : %" 

({y R}, k) lh £ Axis :: node/P : {Y P} U t' U t" ( 

(**) where Axis ^ {ancestor, descendant}, 

r = {X ] ^Ri,...,X„^R„}, z> = {Y ^R}U{Xi^Ri \ i = 
l..n, Et y p 7^ 0},T 7^ 0, t' 7^ 0, s(descendant) = child, and 
s(ancestor) = parent. 



Figure 2: Inference rules for type-projectors 
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F(count(g)) 


HQ) 


F(last(0) 


HQ) 


F(position(g)) = 


P(G) 


F(string(e)) 


P(2)/descendant-or-self :: node[text()] 


F(number(g)) = 


P(2)/descendant-or-self :: node[text()] 


F(not(<2)) 


P(fi) 


F(trueO) 


{self ::node} 


F(falseO) = 


{self ::node [self ::a and self ::£>]} 


F(/(vi,...,v„)) 


{self::node} where v,- is a value 


Figure 3: 


Approximation of XPath functions 



1. E(0,r» 

2. E(v,r,m) 

3. E((<2i, 9 2),I» 

4. E(<tag>g</tag>,r,m) 

5. E(x,r,l) 

6. E(x,r,0) 

7. ECPfl^xj) 

8. E(Path,T,0) 

9. E(FLOWR/P,T,m) 

10. E(Step/P,T,m) 

11. E(Step[CW]/P,r,m) 

12. E(if q then else q2,T,r>i) 

13. E(let $x := ^i return q2,T,m) 

14. E(for $x in q\ return q2,T,m) 



1'. E'(Co/7rf] op Cond2,T,m) 
2'. JZ'(Expr\ cmp Expr2,T,m) 
3'. E\Arith\ op Arith2,T,m) 



E( 9 i,r,m)UE( ?2 ,r,m) 
E(g,r,m) 

{P/descendant-or-self :: node} 

(x;P)er 

U {p} 

(x; P)er 

{Paf/i/descendant-or-self :: node} 
{Path} 

E(FLOWR,Y,m) /E(P,Y,m) 
Step/E(P,r,m) 

( (J Step\q]) /E(P,r,m) 
E(^,r,0) U E(^j ,r,m) U E(q 2 ,T,m) 

E(^ 2 ,rur',m) 

where T' = {{x\P) \ P ^E(q x ,I 

E( 9 i,r,o)uEfe,rur',m) 

where T' = {(*;/>) | PeE(q u I 



U U teW} 

q^E,\Cond\ .r.nO^'GE'fCo/^T.r.m) 

where op £ {and, 

U U cm ^ i'} 

q£iL , (Expr l ,r,m) GE'fKvp^ ,r,m) 

where crap e {=, !=,<,>,>= 

U U {q°pq'} 

qEE'(Arithi,r,>n) q'eE'(Arith 2 ,r,m) 

where op € {+, -, *,div, 



4'. E'(/(&pr!,..., &pr„ ),T,m) 
5'. E'(Afom,r,m) 

Figure 4: XQuery path extraction 



U - U {/(«, 

?ieE'(£tpr,,r,M(/, 1)) ?„eE'(£A7)/-„,r,M(/,«)) 
E(Afo;«,r>«i) Atom ^ f(q\,...,q„) 



())} 
0)} 



<=} 

njod} 

»)} 
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M(count, 1) = M(string, 1) = 1 
M(last,l) = M(number, 1) = 1 

M(position, 1) = M(not,l) = 

Figure 5: Value of the parameter m for various built-in XPath functions 
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s' s 

ON 



5. 

O 

5' 

I 



n 
o 
3. 

OQ 

S 



s 

CD 





10 


00 




9 


00 


N 






'if) 


8 


00 


"m 


7 


00 


c 








6 


00 


Dl 






5 


00 


o 






(1) 


4 


00 


-C 








3 


00 








o 


2 


00 










1 


00 







00 



27.356 



1 


1 












\ m _ 1 




- -rt - 




n 



M01 M02 M03 M04 M05 M06 M07 M0 



M09 M10 Mil M12 M13 M14 M15 M16 M17 M18 M19 M20 

XMark Query 



17.035 



17.035 



7.00 - 
6.00 - 
5.00 - 
4.00 - 
3.00 - 
2.00 - 



m 



27.356 



Al A2 A3 A4 A5 A6 A7 



B5 B6 B7 



B9 B10 CI C2 C3 C4 C5 C6 C7 Dl D2 D3 D4 D5 El E2 E3 E4 E5 E6 E7 

XPathMark Query 





A 1 


A6 


B 1 


B 2 


C3 


C4 


D 1 


D2 


E5 


E7 


M 3 


M6 


M7 


M 14 


M 15 


(i) 


3363 


3363 


3363 


3363 


3363 


3363 


3363 


3363 


3363 


2242 


3363 


3363 


3363 


2242 


3363 


(ii) 


447 


67.6 


50.5 


571 


68.3 


137 


65.5 


297.25 


60.8 


605 


92.8 


9 


121 


605 


67.6 


(Hi) 


10 


9 


3 


113 


9 


23 


13 


59 


12 


190 


18 


2 


24 


190 


65 


(iv) 


20 


17 


22 


2.9 


17 


5.8 


12.1 


5.6 


12.3 


2.1 


9 


15.8 


14.3 


2.19 


7.5 


(v) 


3.7 


5.5 


2.2 


22.2 


3.7 


8 


5.4 


9.7 


7.44 


22 


9 


1.8 


4.9 


29 


15.2 



(i) ; Largest queryable document (MB). We stopped our testing at 3363 MB 

( ii) : Pruned size (MB ) 

(Hi): Pruned size for 671 MB (MB) 

(iv) : Speed up (x faster) 

(v) : Memory use in % of original 



Query answering time (s) Memory consumption (MB) 




Query Query 



Figure 7: Experimental results for the Saxon-b/XQuery engine 





Al 


A6 


Bl 


B2 


C3 


C4 


Dl 


D2 


E5 


E7 


M3 


M6 


M7 


M14 


M15 


(i) 


675 


119 


95 


843 


117 


232 


93 


443 


116 


1073 


138 


14 


111 


1073 


436 


<H) 


0.5 


0.5 


0.3 


24.8 


0.5 


45.4 


10.0 


100.0 




59 


1.1 


84.7 


50.8 


54.0 


94.0 


( Hi) 


6.8 


10.5 


8.5 


3.3 


18.1 


2.1 


21.6 


1.0 




1.7 


5.0 


2.8 


1.7 


2.06 


1.0 



(i) : Size of the index (MB) 

(ii) : Amount of I/O (% of the original) 
(Hi): Speed up (x faster) 

Query answering time (s) 
22 



10 



Original 
Pruned 



i 



108/63 41 



E 



73/35 



■ I 



Query ~ " 



Figure 8: Experimental results for the MonetDB engine 
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A 1 /site/closed_auctions/closed_auction/ annotation/description/text/keyword 

A6 /site/people/person [prof ile/gender and prof ile/age] /name 

Bl /site/regions/*/item [parent : :namerica or parent :: samerica] /name 

B 2 //keyword/ancestor: : listitem/text/keyword 

C3 /site/people/person [prof ile/aincome = 

/ site/open_auctions/ open_auction/current] /name 

C4 /site/people/person [watches/watch/id(@open_auction) /seller/Operson = aid] /name 

Dl /site/open_auctions/open_auction[(count(bidder) mod 2) = 0]/interval 

D2 count (//text) + count (//bold) + count (//emph) + count (//keyword) 

E5 /site/regions/*/item [preceding: : item[100] and f ollowing: : item [100] ] /name 

E 7 /site/regions/*/item [contains (substring-bef ore (description, 'eros ' ) , 'passion' ) 
and contains (substring-after (description, 'eros'), 'dangerous ')] /name 

M 3 for $b in $doc/site/open_auctions/open_auction 
where zero-or-one ($b/bidder [1] /increase/text () ) * 2 

<= $b/bidder [last ()] /increase/text () 

return 
<increase 

first="$b/bidder[l] /increase/text () " 
last="$b/bidder [last ()] /increase/text () "/> 

M6 for $b in $doc//site/regions return count ($b//item) 

M 7 for $p in $doc/site 
return 

count ($p//description) + count ($p//annotation) + count ($p//emailaddress) 

M 14 for $i in $doc/site//item 

where contains (string (exactly-one ($i/description) ) , "gold") 
return $i/name/text 

M 15 for $a in 

$doc/site/closed_auctions/closed_auction/annotation/description/parlist/ 
listitem/parlist/listitem/text/emph/keyword/text () 

return <text>$a</text> 

Figure 9: XPathMark (A-E) and XMark (M) queries 
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